HowTo retrieve your messages from club-nihil before it dies!

theYinYeti · December 16, 2002

Here's how I almost automatically retrieved all my messages from the old boards, at club-nihil.net:

Connect to the club-nihil site, and follow the "My posts" link.

Save the index(es) of your posts in as many files as needed. For me, that was:

1search.php.html, 2search.php.html, 3search.php.html, 4search.php.html, 5search.php.html, 6search.php.html, 7search.php.html

Open those files with a Regular-Expressions-enabled text-editor. For all of those files:

- replace text 'http://www.club-nihil.net/mub/' with '' (nothing)

- replace text 'http://club-nihil.net/mub/' with '' (nothing)

- replace regular expression 'href="([^"]*)"' with 'nhref="http://www.club-nihil.net/mub/1"n'

I suppose you can do all that with `grep -E` but I'm not sure grep understands the n for carriage returns...

Save all changes. You'll obtain files with only absolute URL paths, and each href line alone on its line.

Then with a shell, do that:

grep -h '^href="[^"]*"$' *search.php* | sed -e 's/&/&/' -e 's/&highlight=/&start=0/' | grep -E "viewtopic.php?t=[[:digit:]]+&start=" | sort -u | sed -e 's/^href="//' -e 's/"$//' > topics.txt

(only one line!)

The result is a text file with the URLs of all pages to save. Now you just have to get them with wget:

wget -i topics.txt -w 1 -E

In the end, wget tells you how many files were downloaded. You can check that all files are OK with this command, which should give the same number (that is: if you moved the indexes from start somewhere else, else they'll get counted):

grep '<html' * | wc -l

I hope this helps.

Yves.

Edit: I added in bold some code I previously forgot to type. :oops:

aru · December 19, 2002

Here's how I almost automatically retrieved all my messages from the old boards, at club-nihil.net:
Connect to the club-nihil site, and follow the "My posts" link.

Save the index(es) of your posts in as many files as needed. For me, that was:

1search.php.html, 2search.php.html, 3search.php.html, 4search.php.html, 5search.php.html, 6search.php.html, 7search.php.html

EXCELLENT!!!

I've enjoyed a lot this post.

OK, this is what I'll do to simplify a bit the work you proposed (with a single command, and without any previous steps) ;)

Once the search pages are downloaded, as theYinYeti said, do:

[arusabal@localhost ~]$ URL="www.club-nihil.net/mub/"

[arusabal@localhost ~]$ sed -n "/.*(viewtopic.php?t=[[:alnum:]]{1,5}).*topictitle..(.*)</a><.*/ {s//$URL1 => 2/p;}" < 1search.php.html

The output is:

www.club-nihil.net/mub/viewtopic.php?t=7508 => How can one make VIM work/behave like old-fashioned vi?

www.club-nihil.net/mub/viewtopic.php?t=7424 => THE BOARD IS DEATH!!!

www.club-nihil.net/mub/viewtopic.php?t=7147 => THE BOARD WILL NEVER GO DOWN!!!!

www.club-nihil.net/mub/viewtopic.php?t=7457 => weird raknk????!!!

www.club-nihil.net/mub/viewtopic.php?t=687 => Mod Log

www.club-nihil.net/mub/viewtopic.php?t=7439 => HELP! i think either CPU / MEM Died!!

www.club-nihil.net/mub/viewtopic.php?t=7070 => lying low, sort of

www.club-nihil.net/mub/viewtopic.php?t=7438 => Please help a total newbie!

www.club-nihil.net/mub/viewtopic.php?t=65 => Some Mandrake tutorials for newbies

...

...



[arusabal@localhost ~]$

As you see I've beautified a bit the output, just to see which is the thread name of each url.

Redirect the output to a file, for example to topics.txt (same file name that theYinYeti used), and do it for each of the search.php.html pages you saved! (remember to use '>>' to append each output to the previously saved).

Now you have a list of all the threads you ever posted in.

The result is a text file with the URLs of all pages to save. Now you just have to get them with wget:
wget -i topics.txt -w 1 -E

Instead of do it that way, as my output has also the name of the thread, do this:

[arusabal@localhost ~]$ while read line; do

url=${line%% *}  (## the url is everything until the first blank-space ##)

wget -w 1 -E ${url}

done < topics.txt

I hope this helps.

Yves.

me too!

I'll try to download my posts tonight

Thanks Ives, I've had a lot of fun and I have learned a couple of things :D

Cannonfodder · December 19, 2002

I think you guys have managed to slashdot club-nihil :)

aru · December 19, 2002

The only weird thing is that I cannot find a way to download all the posts within a topic, seems that the php variable posts_per_page is set to 15 and I cannot figure out how can I pass to the url that I want ALL the posts!!! :evil:

I'm too tired now to think clear.

well at last I've saved *most* of my posts (402 threads) :P

ramfree17 · December 20, 2002

i wish i could too... :?

ciao!

theYinYeti · December 20, 2002

I think you guys have managed to slashdot club-nihil :)

:lol:

posts_per_page is set to 15 and I cannot figure out how can I pass to the url that I want ALL the posts!!!

See bellow.

EXCELLENT!!! I've enjoyed a lot this post.

Thanks to all of you :) I'm always glad to receive some feedback !

output is:

www.club-nihil.net/mub/viewtopic.php?t=7508 => How can one make VIM work/behave like old-fashioned vi?<!--QuoteEBegin--><!--QuoteEBegin-->www.club-nihil.net/mub/viewtopic.php?t=7424 => THE BOARD IS DEATH!!!<!--QuoteEBegin--><!--QuoteEBegin-->www.club-nihil.net/mub/viewtopic.php?t=7147 => THE BOARD WILL NEVER GO DOWN!!!!<!--QuoteEBegin--><!--QuoteEBegin-->...

Yes, that's one thing I don't like with my solution: the topic title is not visible. But I decided I did not care because I'll search those posts with grep anyway, in much the same way I do with the search link.

The problem with your solution, however, is that only the 't' parameter of each URL is kept. If you want to have all posts of each thread, then you have to replace all the 'highlight=' with 'start=0', then keep all the '...start=...' URLs, and then delete duplicates. That's the only way I found to have everything: one page for 0-15, another for 16-30, ...

By the way, I did some typing errors when writing this topic. I edited my first post and corrected in bold.

Yves.

aru · December 20, 2002

Yes, you are right, so after learning a bit more about sed and following your advices, this is what I've ended to do:

URL="www.club-nihil.net/mub/"

for file in *search.php.html; do

  echo -e $(

  sed -n '/.*(viewtopic.php?t=[[:alnum:]]{1,5}).*topictitle..(.*)</a><.*/ {

       s//ntagn# 2: tagnt'${URL}'1 tagn/;

       s/(viewtopic.php?t=[[:alnum:]]{1,5})&(start=[[:alnum:]]{1,3})/nt'${URL}'1&2 tagn/g;

       p;

       }' < ${file}

  ) | grep tag | grep -v 'start=0' | sed 's/tag//g'

done > TOPICS.TXT

(the "tag" thing is just a trick ;) )

Now the output is:

...



# HELP! i think either CPU / MEM Died!!:

       www.club-nihil.net/mub/viewtopic.php?t=7439



# lying low, sort of:

       www.club-nihil.net/mub/viewtopic.php?t=7070



# Please help a total newbie!:

       www.club-nihil.net/mub/viewtopic.php?t=7438



# Some Mandrake tutorials for newbies:

       www.club-nihil.net/mub/viewtopic.php?t=65

       www.club-nihil.net/mub/viewtopic.php?t=65&start=15

       www.club-nihil.net/mub/viewtopic.php?t=65&start=30

       www.club-nihil.net/mub/viewtopic.php?t=65&start=45



# deno's point of view - again - about us (IMHO both boards):

       www.club-nihil.net/mub/viewtopic.php?t=7418



# [b]It's a nightmare[/b]; will I ever wake up? Can you help?:

       www.club-nihil.net/mub/viewtopic.php?t=7159

       www.club-nihil.net/mub/viewtopic.php?t=7159&start=15



# Paketmanagement without X:

       www.club-nihil.net/mub/viewtopic.php?t=7395



...

Which is trivial to parse to make a decent HTML index page.

To retrieve the files, do:

 [arusabal@localhost ~]$ while read url; do

> if echo $url | grep www.club-nihil.net &> /dev/null;

> then wget -E $url; fi;

> done < TOPICS.TXT

I noticed that the problem is still not *completely* solved, because very long treads are linked as:

Goto page: 1 ... n-2, n-1, n

I know that I can make an script to handle this, but who cares... I don't remember any *very-long* thread that I've ever posted-in that deserves such effort... and if there's any, it will be faster to do it by hand! :)

Again, I had a lot of fun (and learned many things) :)

Steve Scrimpshire · January 13, 2003

Following YinYeti's advice, all I get after this:

grep -h '^href="[^"]*"$' *search.php* | sed -e 's/&/&/' -e 's/&highlight=/&start=0/' | grep -E "viewtopic.php?t=[[:digit:]]+&start=" | sort -u | sed -e 's/^href="//' -e 's/"$//' > topics.txt

Is a blank file named topics.txt Can't figure out what I'm doing wrong.

aru · January 13, 2003

Following YinYeti's advice, all I get after this:
grep -h '^href="[^"]*"$' *search.php* | sed -e 's/&/&/' -e 's/&highlight=/&start=0/' | grep -E "viewtopic.php?t=[[:digit:]]+&start=" | sort -u | sed -e 's/^href="//' -e 's/"$//' > topics.txt
Is a blank file named topics.txt Can't figure out what I'm doing wrong.

don't follow theYinYeti's advice, follow mine :P

Just be sure that no blankspace is appended at the end of the lines if you copy and paste the commands (otherwise the redirections and pipes wont work right)

Steve Scrimpshire · January 13, 2003

It has something to do with the syntax of this being off (or my text editor [Kate] not understanding properly):

- replace regular expression 'href="([^"]*)"' with 'nhref="http://www.club-nihil.net/mub/1"n'

I think.

Steve Scrimpshire · January 13, 2003

Well, I wound up doing it mostly aru's way, but for those of you/us who might be wanting to do this who were moderators over there, wget will not retrieve the posts you/we made in the password-protected moderators' forum (which is not even visible to regular users). You probably already knew that, but I just figured it out. They get listed and attempted to be downloaded, but wget can't retrieve them.

aru · January 13, 2003

Yes, that is true, and I didn't care when I downloaded my posts; but both 'wget' and 'lynx -source' have options to use the cookies of netscape and thus allowing autologin (I never tried that, but I know that the posibility exists).

... but who cares now, the anon told me that old board went down today :(

I feel hapy for having saved yesterday all the posts from Tips&Tricks and FAQs&Howtos forums :bitter smile:

Steve Scrimpshire · January 14, 2003

... but who cares now, the anon told me that old board went down today :(

I feel hapy for having saved yesterday all the posts from Tips&Tricks and FAQs&Howtos forums :bitter smile:

It sure did. It was working working earlier today. Wonder if Tom just vanished off the face of the earth. The first glimpse of a message you get when it is redirecting you is "Site suspended..."

HowTo retrieve your messages from club-nihil before it dies!

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation