Jump to content

how do i do this in sed? (or any tool for that matter)


Recommended Posts

i need to get the path and file from a website. i have already retrieved the line in sed with this

cat xuf.htm | sed -n /SRC=.*xuf.*.gif/p

which gives this

<a href="http://ars.userfriendly.org/cartoons/?id=20030904"><img ALT="Latest Strip" height="219" WIDTH="576" BORDER=0 SRC="http://www.userfriendly.org/cartoons/

archives/03sep/xuf005904.gif"></a>

 

how do i process this to get only /cartoons/

archives/03sep/xuf005904.gif?

 

thanks.

 

hey garu, look at me, im learning regexp. :#:

 

ciao!

Link to comment
Share on other sites

Hey ramfree17

 

This should give aru a good laugh, but I can almost get what you want using cut.

 

[booyah@yoyoyo booyah]$ cat ramfree | cut -d / -f8-12 | cut -d'"' -f1

cartoons/archives/03sep/xuf005904.gif

 

I want to figure out aru's update scripts, so I've been playing around with bash. Thought I'd give it a shot and see what the guru says.

Link to comment
Share on other sites

such an easy solution. the reason why i didnt think of it was because i didn tthing gnuwin32 has ported the cut utility to windows (yes im doing this from work).

 

i have read the package page again and it is indeed ported under the testutils package.

 

now i only have to coax wget to ouput its downloaded page to sed and im done. :#:

 

ciao!

Link to comment
Share on other sites

wget -O blah http://www.userfriendly.org && cat blah | sed -n /SRC=.*xuf.*.gif/p

 

I don't know if this is what you want but i can't think how to redirect wget to standard output

 

Booyah had a better idea but i think lynx -source will work better as with lynx -dump is formated so no <SRC> tags

[john@bob Desktop]$ lynx -source http://www.userfriendly.org | sed -n /SRC=.*xuf.*.gif/p

<A href="http://ars.userfriendly.org/cartoons/?id=20030905"><IMG ALT="Latest Strip" height="219" WIDTH="576" BORDER=0 SRC="http://www.userfriendly.org/cartoons/archives/03sep/xuf005905.gif"></A>

[john@bob Desktop]$

Link to comment
Share on other sites

wget -O blah http://www.userfriendly.org && cat blah | sed -n /SRC=.*xuf.*.gif/p

too complicated ;) you can do it with just one pipe.

 

Ramfree try this:

 

~$ wget -O - http://your.url.org | sed -n 's@.*SRC.*://.[^/]*(.*)".*@1@p'

 

Explanation (works, i've tried):

Use of @ (or anything else) instead of / because the url format, no need to scape the slashes inside the uri.

Match the whole line (see the .* and the .* at both the beginning and the end of the REGEX)

Target what you want: SRC.*://.[^/]*(.*)" ; selecting the subexpression within ( and ) ; the non selected part of the regex is from SRC to the first slash (/) since the http://, or in other words everything from http:// without an slash. The selected part is everything from the first / (as was excluded from the non selected part, to the " ; then substitute the WHOLE line with the selected part from the regex and print the result.

 

 

If I'll have more time later in the evening I'll do it better (this is something wrote in a hurry) and maybe with awk and/or with phyton too :D

 

HTH

Link to comment
Share on other sites

with python:

 

#! /usr/bin/python

import re, urllib, sys



url="http://your.url.org"



try:  

   html_text = urllib.urlopen(url).read()

except IOError:

   sys.exit(1) 



item_re = re.compile('SRC.+//.+?/(.+)"', re.S|re.M|re.I)



for item in re.findall(item_re, html_text):

   print item

 

Edited: typo removed

Link to comment
Share on other sites

f*?king job!!! I'm missing a lot of fun these days, no time to visit this board, no time to play with this thread, and what is worst no time to play with linux!!!!!!! :cry:

 

If I could only have some more time I'd work a little more with these regexps to make them simplier and clearer, but no way... once again... I have to go... see you in a couple of days :?

Link to comment
Share on other sites

thanks garu. i knew your twisted mind will deliver something par excellant (sic). :#:

 

now i have to much on this one. im just learning and the garu has fed me something more than i can bite. :#:

 

Ramfree try this:

 

~$ wget -O - http://your.url.org | sed -n 's@.*SRC.*://.[^/]*(.*)".*@1@p'

 

and he did this while on a hurry.. wow!

 

ciao!

Link to comment
Share on other sites

It wouldn't be so bad if he'd quit sticking his tongue out at us when he does thinks like that :lol:

 

:mystilol:

 

i have adapted his one-liner but i still dont understand it. :(

 

[edit] spoke too soon. it works in cygwin but the line

 type page.hml | sed -n 's@.*href=*"(.*x86.exe)".*@1@p'

produces the error

sed: -e expression #1, char 1: Unknown command: `''

 

am i asking too much for help on this one? or is this a problem of the windows port of sed?

 

ciao!

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
 Share

×
×
  • Create New...