theYinYeti Posted December 16, 2002 Report Share Posted December 16, 2002 Here's how I almost automatically retrieved all my messages from the old boards, at club-nihil.net: Connect to the club-nihil site, and follow the "My posts" link. Save the index(es) of your posts in as many files as needed. For me, that was: 1search.php.html, 2search.php.html, 3search.php.html, 4search.php.html, 5search.php.html, 6search.php.html, 7search.php.html Open those files with a Regular-Expressions-enabled text-editor. For all of those files: - replace text 'http://www.club-nihil.net/mub/' with '' (nothing) - replace text 'http://club-nihil.net/mub/' with '' (nothing) - replace regular expression 'href="([^"]*)"' with 'nhref="http://www.club-nihil.net/mub/1"n' I suppose you can do all that with `grep -E` but I'm not sure grep understands the n for carriage returns... Save all changes. You'll obtain files with only absolute URL paths, and each href line alone on its line. Then with a shell, do that: grep -h '^href="[^"]*"$' *search.php* | sed -e 's/&/&/' -e 's/&highlight=/&start=0/' | grep -E "viewtopic.php?t=[[:digit:]]+&start=" | sort -u | sed -e 's/^href="//' -e 's/"$//' > topics.txt (only one line!) The result is a text file with the URLs of all pages to save. Now you just have to get them with wget: wget -i topics.txt -w 1 -E In the end, wget tells you how many files were downloaded. You can check that all files are OK with this command, which should give the same number (that is: if you moved the indexes from start somewhere else, else they'll get counted): grep '<html' * | wc -l I hope this helps. Yves. Edit: I added in bold some code I previously forgot to type. Quote Link to comment Share on other sites More sharing options...
aru Posted December 19, 2002 Report Share Posted December 19, 2002 Here's how I almost automatically retrieved all my messages from the old boards, at club-nihil.net:Connect to the club-nihil site, and follow the "My posts" link. Save the index(es) of your posts in as many files as needed. For me, that was: 1search.php.html, 2search.php.html, 3search.php.html, 4search.php.html, 5search.php.html, 6search.php.html, 7search.php.html EXCELLENT!!! I've enjoyed a lot this post. OK, this is what I'll do to simplify a bit the work you proposed (with a single command, and without any previous steps) ;) Once the search pages are downloaded, as theYinYeti said, do: [arusabal@localhost ~]$ URL="www.club-nihil.net/mub/" [arusabal@localhost ~]$ sed -n "/.*(viewtopic.php?t=[[:alnum:]]{1,5}).*topictitle..(.*)</a><.*/ {s//$URL1 => 2/p;}" < 1search.php.html The output is: www.club-nihil.net/mub/viewtopic.php?t=7508 => How can one make VIM work/behave like old-fashioned vi? www.club-nihil.net/mub/viewtopic.php?t=7424 => THE BOARD IS DEATH!!! www.club-nihil.net/mub/viewtopic.php?t=7147 => THE BOARD WILL NEVER GO DOWN!!!! www.club-nihil.net/mub/viewtopic.php?t=7457 => weird raknk????!!! www.club-nihil.net/mub/viewtopic.php?t=687 => Mod Log www.club-nihil.net/mub/viewtopic.php?t=7439 => HELP! i think either CPU / MEM Died!! www.club-nihil.net/mub/viewtopic.php?t=7070 => lying low, sort of www.club-nihil.net/mub/viewtopic.php?t=7438 => Please help a total newbie! www.club-nihil.net/mub/viewtopic.php?t=65 => Some Mandrake tutorials for newbies ... ... [arusabal@localhost ~]$ As you see I've beautified a bit the output, just to see which is the thread name of each url. Redirect the output to a file, for example to topics.txt (same file name that theYinYeti used), and do it for each of the search.php.html pages you saved! (remember to use '>>' to append each output to the previously saved). Now you have a list of all the threads you ever posted in. The result is a text file with the URLs of all pages to save. Now you just have to get them with wget: wget -i topics.txt -w 1 -E Instead of do it that way, as my output has also the name of the thread, do this: [arusabal@localhost ~]$ while read line; do url=${line%% *} (## the url is everything until the first blank-space ##) wget -w 1 -E ${url} done < topics.txt I hope this helps. Yves. me too! I'll try to download my posts tonight Thanks Ives, I've had a lot of fun and I have learned a couple of things :D Quote Link to comment Share on other sites More sharing options...
Cannonfodder Posted December 19, 2002 Report Share Posted December 19, 2002 I think you guys have managed to slashdot club-nihil :) Quote Link to comment Share on other sites More sharing options...
aru Posted December 19, 2002 Report Share Posted December 19, 2002 The only weird thing is that I cannot find a way to download all the posts within a topic, seems that the php variable posts_per_page is set to 15 and I cannot figure out how can I pass to the url that I want ALL the posts!!! I'm too tired now to think clear. well at last I've saved *most* of my posts (402 threads) :P Quote Link to comment Share on other sites More sharing options...
ramfree17 Posted December 20, 2002 Report Share Posted December 20, 2002 i wish i could too... :? ciao! Quote Link to comment Share on other sites More sharing options...
theYinYeti Posted December 20, 2002 Author Report Share Posted December 20, 2002 I think you guys have managed to slashdot club-nihil :) :lol:posts_per_page is set to 15 and I cannot figure out how can I pass to the url that I want ALL the posts!!! See bellow.EXCELLENT!!! I've enjoyed a lot this post.Thanks to all of you :) I'm always glad to receive some feedback !output is:www.club-nihil.net/mub/viewtopic.php?t=7508 => How can one make VIM work/behave like old-fashioned vi?<!--QuoteEBegin--><!--QuoteEBegin-->www.club-nihil.net/mub/viewtopic.php?t=7424 => THE BOARD IS DEATH!!!<!--QuoteEBegin--><!--QuoteEBegin-->www.club-nihil.net/mub/viewtopic.php?t=7147 => THE BOARD WILL NEVER GO DOWN!!!!<!--QuoteEBegin--><!--QuoteEBegin-->... Yes, that's one thing I don't like with my solution: the topic title is not visible. But I decided I did not care because I'll search those posts with grep anyway, in much the same way I do with the search link.The problem with your solution, however, is that only the 't' parameter of each URL is kept. If you want to have all posts of each thread, then you have to replace all the 'highlight=' with 'start=0', then keep all the '...start=...' URLs, and then delete duplicates. That's the only way I found to have everything: one page for 0-15, another for 16-30, ... By the way, I did some typing errors when writing this topic. I edited my first post and corrected in bold. Yves. Quote Link to comment Share on other sites More sharing options...
aru Posted December 20, 2002 Report Share Posted December 20, 2002 Yes, you are right, so after learning a bit more about sed and following your advices, this is what I've ended to do: URL="www.club-nihil.net/mub/" for file in *search.php.html; do echo -e $( sed -n '/.*(viewtopic.php?t=[[:alnum:]]{1,5}).*topictitle..(.*)</a><.*/ { s//ntagn# 2: tagnt'${URL}'1 tagn/; s/(viewtopic.php?t=[[:alnum:]]{1,5})&(start=[[:alnum:]]{1,3})/nt'${URL}'1&2 tagn/g; p; }' < ${file} ) | grep tag | grep -v 'start=0' | sed 's/tag//g' done > TOPICS.TXT (the "tag" thing is just a trick ;) ) Now the output is: ... # HELP! i think either CPU / MEM Died!!: www.club-nihil.net/mub/viewtopic.php?t=7439 # lying low, sort of: www.club-nihil.net/mub/viewtopic.php?t=7070 # Please help a total newbie!: www.club-nihil.net/mub/viewtopic.php?t=7438 # Some Mandrake tutorials for newbies: www.club-nihil.net/mub/viewtopic.php?t=65 www.club-nihil.net/mub/viewtopic.php?t=65&start=15 www.club-nihil.net/mub/viewtopic.php?t=65&start=30 www.club-nihil.net/mub/viewtopic.php?t=65&start=45 # deno's point of view - again - about us (IMHO both boards): www.club-nihil.net/mub/viewtopic.php?t=7418 # [b]It's a nightmare[/b]; will I ever wake up? Can you help?: www.club-nihil.net/mub/viewtopic.php?t=7159 www.club-nihil.net/mub/viewtopic.php?t=7159&start=15 # Paketmanagement without X: www.club-nihil.net/mub/viewtopic.php?t=7395 ... Which is trivial to parse to make a decent HTML index page. To retrieve the files, do: [arusabal@localhost ~]$ while read url; do > if echo $url | grep www.club-nihil.net &> /dev/null; > then wget -E $url; fi; > done < TOPICS.TXT I noticed that the problem is still not *completely* solved, because very long treads are linked as: Goto page: 1 ... n-2, n-1, n I know that I can make an script to handle this, but who cares... I don't remember any *very-long* thread that I've ever posted-in that deserves such effort... and if there's any, it will be faster to do it by hand! :) Again, I had a lot of fun (and learned many things) :) Quote Link to comment Share on other sites More sharing options...
Steve Scrimpshire Posted January 13, 2003 Report Share Posted January 13, 2003 Following YinYeti's advice, all I get after this: grep -h '^href="[^"]*"$' *search.php* | sed -e 's/&/&/' -e 's/&highlight=/&start=0/' | grep -E "viewtopic.php?t=[[:digit:]]+&start=" | sort -u | sed -e 's/^href="//' -e 's/"$//' > topics.txt Is a blank file named topics.txt Can't figure out what I'm doing wrong. Quote Link to comment Share on other sites More sharing options...
aru Posted January 13, 2003 Report Share Posted January 13, 2003 Following YinYeti's advice, all I get after this: grep -h '^href="[^"]*"$' *search.php* | sed -e 's/&/&/' -e 's/&highlight=/&start=0/' | grep -E "viewtopic.php?t=[[:digit:]]+&start=" | sort -u | sed -e 's/^href="//' -e 's/"$//' > topics.txt Is a blank file named topics.txt Can't figure out what I'm doing wrong. don't follow theYinYeti's advice, follow mine :P Just be sure that no blankspace is appended at the end of the lines if you copy and paste the commands (otherwise the redirections and pipes wont work right) Quote Link to comment Share on other sites More sharing options...
Steve Scrimpshire Posted January 13, 2003 Report Share Posted January 13, 2003 It has something to do with the syntax of this being off (or my text editor [Kate] not understanding properly): - replace regular expression 'href="([^"]*)"' with 'nhref="http://www.club-nihil.net/mub/1"n' I think. Quote Link to comment Share on other sites More sharing options...
Steve Scrimpshire Posted January 13, 2003 Report Share Posted January 13, 2003 Well, I wound up doing it mostly aru's way, but for those of you/us who might be wanting to do this who were moderators over there, wget will not retrieve the posts you/we made in the password-protected moderators' forum (which is not even visible to regular users). You probably already knew that, but I just figured it out. They get listed and attempted to be downloaded, but wget can't retrieve them. Quote Link to comment Share on other sites More sharing options...
aru Posted January 13, 2003 Report Share Posted January 13, 2003 Yes, that is true, and I didn't care when I downloaded my posts; but both 'wget' and 'lynx -source' have options to use the cookies of netscape and thus allowing autologin (I never tried that, but I know that the posibility exists). ... but who cares now, the anon told me that old board went down today :( I feel hapy for having saved yesterday all the posts from Tips&Tricks and FAQs&Howtos forums :bitter smile: Quote Link to comment Share on other sites More sharing options...
Steve Scrimpshire Posted January 14, 2003 Report Share Posted January 14, 2003 ... but who cares now, the anon told me that old board went down today :( I feel hapy for having saved yesterday all the posts from Tips&Tricks and FAQs&Howtos forums :bitter smile: It sure did. It was working working earlier today. Wonder if Tom just vanished off the face of the earth. The first glimpse of a message you get when it is redirecting you is "Site suspended..." Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.