ramfree17 Posted September 4, 2003 Report Share Posted September 4, 2003 i need to get the path and file from a website. i have already retrieved the line in sed with this cat xuf.htm | sed -n /SRC=.*xuf.*.gif/p which gives this <a href="http://ars.userfriendly.org/cartoons/?id=20030904"><img ALT="Latest Strip" height="219" WIDTH="576" BORDER=0 SRC="http://www.userfriendly.org/cartoons/archives/03sep/xuf005904.gif"></a> how do i process this to get only /cartoons/ archives/03sep/xuf005904.gif? thanks. hey garu, look at me, im learning regexp. :#: ciao! Quote Link to comment Share on other sites More sharing options...
Guest BooYah Posted September 5, 2003 Report Share Posted September 5, 2003 Hey ramfree17 This should give aru a good laugh, but I can almost get what you want using cut. [booyah@yoyoyo booyah]$ cat ramfree | cut -d / -f8-12 | cut -d'"' -f1 cartoons/archives/03sep/xuf005904.gif I want to figure out aru's update scripts, so I've been playing around with bash. Thought I'd give it a shot and see what the guru says. Quote Link to comment Share on other sites More sharing options...
ramfree17 Posted September 5, 2003 Author Report Share Posted September 5, 2003 such an easy solution. the reason why i didnt think of it was because i didn tthing gnuwin32 has ported the cut utility to windows (yes im doing this from work). i have read the package page again and it is indeed ported under the testutils package. now i only have to coax wget to ouput its downloaded page to sed and im done. :#: ciao! Quote Link to comment Share on other sites More sharing options...
Guest BooYah Posted September 5, 2003 Report Share Posted September 5, 2003 What about: lynx --dump I'm still waiting for aru to jump in here and show us how it's done in 10 keystrokes or less. Quote Link to comment Share on other sites More sharing options...
johnnyv Posted September 5, 2003 Report Share Posted September 5, 2003 wget -O blah http://www.userfriendly.org && cat blah | sed -n /SRC=.*xuf.*.gif/p I don't know if this is what you want but i can't think how to redirect wget to standard output Booyah had a better idea but i think lynx -source will work better as with lynx -dump is formated so no <SRC> tags [john@bob Desktop]$ lynx -source http://www.userfriendly.org | sed -n /SRC=.*xuf.*.gif/p<A href="http://ars.userfriendly.org/cartoons/?id=20030905"><IMG ALT="Latest Strip" height="219" WIDTH="576" BORDER=0 SRC="http://www.userfriendly.org/cartoons/archives/03sep/xuf005905.gif"></A> [john@bob Desktop]$ Quote Link to comment Share on other sites More sharing options...
aru Posted September 6, 2003 Report Share Posted September 6, 2003 wget -O blah http://www.userfriendly.org && cat blah | sed -n /SRC=.*xuf.*.gif/p too complicated ;) you can do it with just one pipe. Ramfree try this: ~$ wget -O - http://your.url.org | sed -n 's@.*SRC.*://.[^/]*(.*)".*@1@p' Explanation (works, i've tried): Use of @ (or anything else) instead of / because the url format, no need to scape the slashes inside the uri. Match the whole line (see the .* and the .* at both the beginning and the end of the REGEX) Target what you want: SRC.*://.[^/]*(.*)" ; selecting the subexpression within ( and ) ; the non selected part of the regex is from SRC to the first slash (/) since the http://, or in other words everything from http:// without an slash. The selected part is everything from the first / (as was excluded from the non selected part, to the " ; then substitute the WHOLE line with the selected part from the regex and print the result. If I'll have more time later in the evening I'll do it better (this is something wrote in a hurry) and maybe with awk and/or with phyton too :D HTH Quote Link to comment Share on other sites More sharing options...
aru Posted September 7, 2003 Report Share Posted September 7, 2003 with python: #! /usr/bin/python import re, urllib, sys url="http://your.url.org" try: html_text = urllib.urlopen(url).read() except IOError: sys.exit(1) item_re = re.compile('SRC.+//.+?/(.+)"', re.S|re.M|re.I) for item in re.findall(item_re, html_text): print item Edited: typo removed Quote Link to comment Share on other sites More sharing options...
aru Posted September 7, 2003 Report Share Posted September 7, 2003 With awk: ~$ wget -O - http://your.url.org | awk '/SRC=/ {$0 = gensub("^.*SRC.*//.[^/]*/(.*)".*","1",g,$0); print}' Not quite sure on this one, but I guess it will work too Quote Link to comment Share on other sites More sharing options...
aru Posted September 7, 2003 Report Share Posted September 7, 2003 f*?king job!!! I'm missing a lot of fun these days, no time to visit this board, no time to play with this thread, and what is worst no time to play with linux!!!!!!! If I could only have some more time I'd work a little more with these regexps to make them simplier and clearer, but no way... once again... I have to go... see you in a couple of days :? Quote Link to comment Share on other sites More sharing options...
ramfree17 Posted September 11, 2003 Author Report Share Posted September 11, 2003 thanks garu. i knew your twisted mind will deliver something par excellant (sic). :#: now i have to much on this one. im just learning and the garu has fed me something more than i can bite. :#: Ramfree try this: ~$ wget -O - http://your.url.org | sed -n 's@.*SRC.*://.[^/]*(.*)".*@1@p' and he did this while on a hurry.. wow! ciao! Quote Link to comment Share on other sites More sharing options...
Guest BooYah Posted September 11, 2003 Report Share Posted September 11, 2003 the garu has fed me something more than i can bite. It wouldn't be so bad if he'd quit sticking his tongue out at us when he does thinks like that :lol: Quote Link to comment Share on other sites More sharing options...
ramfree17 Posted September 15, 2003 Author Report Share Posted September 15, 2003 It wouldn't be so bad if he'd quit sticking his tongue out at us when he does thinks like that :lol: :mystilol: i have adapted his one-liner but i still dont understand it. :( [edit] spoke too soon. it works in cygwin but the line type page.hml | sed -n 's@.*href=*"(.*x86.exe)".*@1@p' produces the error sed: -e expression #1, char 1: Unknown command: `'' am i asking too much for help on this one? or is this a problem of the windows port of sed? ciao! Quote Link to comment Share on other sites More sharing options...
ramfree17 Posted September 15, 2003 Author Report Share Posted September 15, 2003 scrap that. the windows port of sed doesnt require the enclosing quotes. how n00bie.... :banghead: ciao! Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.