Jump to content

HowTo retrieve your messages from club-nihil before it dies!


theYinYeti
 Share

Recommended Posts

Here's how I almost automatically retrieved all my messages from the old boards, at club-nihil.net:

Connect to the club-nihil site, and follow the "My posts" link.

 

Save the index(es) of your posts in as many files as needed. For me, that was:

1search.php.html, 2search.php.html, 3search.php.html, 4search.php.html, 5search.php.html, 6search.php.html, 7search.php.html

 

Open those files with a Regular-Expressions-enabled text-editor. For all of those files:

- replace text 'http://www.club-nihil.net/mub/' with '' (nothing)

- replace text 'http://club-nihil.net/mub/' with '' (nothing)

- replace regular expression 'href="([^"]*)"' with 'nhref="http://www.club-nihil.net/mub/1"n'

I suppose you can do all that with `grep -E` but I'm not sure grep understands the n for carriage returns...

 

Save all changes. You'll obtain files with only absolute URL paths, and each href line alone on its line.

Then with a shell, do that:

grep -h '^href="[^"]*"$' *search.php* | sed -e 's/&/&/' -e 's/&highlight=/&start=0/' | grep -E "viewtopic.php?t=[[:digit:]]+&start=" | sort -u | sed -e 's/^href="//' -e 's/"$//' > topics.txt

(only one line!)

 

The result is a text file with the URLs of all pages to save. Now you just have to get them with wget:

wget -i topics.txt -w 1 -E

 

In the end, wget tells you how many files were downloaded. You can check that all files are OK with this command, which should give the same number (that is: if you moved the indexes from start somewhere else, else they'll get counted):

grep '<html' * | wc -l

 

I hope this helps.

 

Yves.

 

Edit: I added in bold some code I previously forgot to type. :oops:

Link to comment
Share on other sites

Here's how I almost automatically retrieved all my messages from the old boards, at club-nihil.net:

Connect to the club-nihil site, and follow the "My posts" link.

 

Save the index(es) of your posts in as many files as needed. For me, that was:

1search.php.html, 2search.php.html, 3search.php.html, 4search.php.html, 5search.php.html, 6search.php.html, 7search.php.html

EXCELLENT!!!

I've enjoyed a lot this post.

 

OK, this is what I'll do to simplify a bit the work you proposed (with a single command, and without any previous steps) ;)

Once the search pages are downloaded, as theYinYeti said, do:

[arusabal@localhost ~]$ URL="www.club-nihil.net/mub/"

[arusabal@localhost ~]$ sed -n "/.*(viewtopic.php?t=[[:alnum:]]{1,5}).*topictitle..(.*)</a><.*/ {s//$URL1 => 2/p;}" < 1search.php.html

The output is:

www.club-nihil.net/mub/viewtopic.php?t=7508 => How can one make VIM work/behave like old-fashioned vi?

www.club-nihil.net/mub/viewtopic.php?t=7424 => THE BOARD IS DEATH!!!

www.club-nihil.net/mub/viewtopic.php?t=7147 => THE BOARD WILL NEVER GO DOWN!!!!

www.club-nihil.net/mub/viewtopic.php?t=7457 => weird raknk????!!!

www.club-nihil.net/mub/viewtopic.php?t=687 => Mod Log

www.club-nihil.net/mub/viewtopic.php?t=7439 => HELP! i think either CPU / MEM Died!!

www.club-nihil.net/mub/viewtopic.php?t=7070 => lying low, sort of

www.club-nihil.net/mub/viewtopic.php?t=7438 => Please help a total newbie!

www.club-nihil.net/mub/viewtopic.php?t=65 => Some Mandrake tutorials for newbies

...

...



[arusabal@localhost ~]$

 

As you see I've beautified a bit the output, just to see which is the thread name of each url.

 

Redirect the output to a file, for example to topics.txt (same file name that theYinYeti used), and do it for each of the search.php.html pages you saved! (remember to use '>>' to append each output to the previously saved).

 

Now you have a list of all the threads you ever posted in.

 

The result is a text file with the URLs of all pages to save. Now you just have to get them with wget:

wget -i topics.txt -w 1 -E

Instead of do it that way, as my output has also the name of the thread, do this:

[arusabal@localhost ~]$ while read line; do

url=${line%% *}  (## the url is everything until the first blank-space ##)

wget -w 1 -E ${url}

done < topics.txt

I hope this helps.

 

Yves.

 

me too!

 

I'll try to download my posts tonight

 

Thanks Ives, I've had a lot of fun and I have learned a couple of things :D

Link to comment
Share on other sites

The only weird thing is that I cannot find a way to download all the posts within a topic, seems that the php variable posts_per_page is set to 15 and I cannot figure out how can I pass to the url that I want ALL the posts!!! :evil:

 

I'm too tired now to think clear.

 

well at last I've saved *most* of my posts (402 threads) :P

Link to comment
Share on other sites

I think you guys have managed to slashdot club-nihil :)
:lol:
posts_per_page is set to 15 and I cannot figure out how can I pass to the url that I want ALL the posts!!!  :evil:
See bellow.
EXCELLENT!!! I've enjoyed a lot this post.
Thanks to all of you :) I'm always glad to receive some feedback !
output is:
www.club-nihil.net/mub/viewtopic.php?t=7508 => How can one make VIM work/behave like old-fashioned vi?<!--QuoteEBegin--><!--QuoteEBegin-->www.club-nihil.net/mub/viewtopic.php?t=7424 => THE BOARD IS DEATH!!!<!--QuoteEBegin--><!--QuoteEBegin-->www.club-nihil.net/mub/viewtopic.php?t=7147 => THE BOARD WILL NEVER GO DOWN!!!!<!--QuoteEBegin--><!--QuoteEBegin-->...

Yes, that's one thing I don't like with my solution: the topic title is not visible. But I decided I did not care because I'll search those posts with grep anyway, in much the same way I do with the search link.

The problem with your solution, however, is that only the 't' parameter of each URL is kept. If you want to have all posts of each thread, then you have to replace all the 'highlight=' with 'start=0', then keep all the '...start=...' URLs, and then delete duplicates. That's the only way I found to have everything: one page for 0-15, another for 16-30, ...

 

By the way, I did some typing errors when writing this topic. I edited my first post and corrected in bold.

 

Yves.

Link to comment
Share on other sites

Yes, you are right, so after learning a bit more about sed and following your advices, this is what I've ended to do:

 

URL="www.club-nihil.net/mub/"

for file in *search.php.html; do

  echo -e $(

  sed -n '/.*(viewtopic.php?t=[[:alnum:]]{1,5}).*topictitle..(.*)</a><.*/ {

       s//ntagn# 2: tagnt'${URL}'1 tagn/;

       s/(viewtopic.php?t=[[:alnum:]]{1,5})&(start=[[:alnum:]]{1,3})/nt'${URL}'1&2 tagn/g;

       p;

       }' < ${file}

  ) | grep tag | grep -v 'start=0' | sed 's/tag//g'

done > TOPICS.TXT

(the "tag" thing is just a trick ;) )

 

Now the output is:

...



# HELP! i think either CPU / MEM Died!!:

       www.club-nihil.net/mub/viewtopic.php?t=7439



# lying low, sort of:

       www.club-nihil.net/mub/viewtopic.php?t=7070



# Please help a total newbie!:

       www.club-nihil.net/mub/viewtopic.php?t=7438



# Some Mandrake tutorials for newbies:

       www.club-nihil.net/mub/viewtopic.php?t=65

       www.club-nihil.net/mub/viewtopic.php?t=65&start=15

       www.club-nihil.net/mub/viewtopic.php?t=65&start=30

       www.club-nihil.net/mub/viewtopic.php?t=65&start=45



# deno's point of view - again - about us (IMHO both boards):

       www.club-nihil.net/mub/viewtopic.php?t=7418



# [b]It's a nightmare[/b]; will I ever wake up? Can you help?:

       www.club-nihil.net/mub/viewtopic.php?t=7159

       www.club-nihil.net/mub/viewtopic.php?t=7159&start=15



# Paketmanagement without X:

       www.club-nihil.net/mub/viewtopic.php?t=7395



...

 

Which is trivial to parse to make a decent HTML index page.

 

To retrieve the files, do:

 [arusabal@localhost ~]$ while read url; do

> if echo $url | grep www.club-nihil.net &> /dev/null;

> then wget -E $url; fi;

> done < TOPICS.TXT

 

 

I noticed that the problem is still not *completely* solved, because very long treads are linked as:

 

Goto page: 1 ... n-2, n-1, n

 

I know that I can make an script to handle this, but who cares... I don't remember any *very-long* thread that I've ever posted-in that deserves such effort... and if there's any, it will be faster to do it by hand! :)

 

Again, I had a lot of fun (and learned many things) :)

Link to comment
Share on other sites

  • 4 weeks later...

Following YinYeti's advice, all I get after this:

 

grep -h '^href="[^"]*"$' *search.php* | sed -e 's/&/&/' -e 's/&highlight=/&start=0/' | grep -E "viewtopic.php?t=[[:digit:]]+&start=" | sort -u | sed -e 's/^href="//' -e 's/"$//' > topics.txt

 

Is a blank file named topics.txt Can't figure out what I'm doing wrong.

Link to comment
Share on other sites

Following YinYeti's advice, all I get after this:

 

grep -h '^href="[^"]*"$' *search.php* | sed -e 's/&/&/' -e 's/&highlight=/&start=0/' | grep -E "viewtopic.php?t=[[:digit:]]+&start=" | sort -u | sed -e 's/^href="//' -e 's/"$//' > topics.txt

 

Is a blank file named topics.txt Can't figure out what I'm doing wrong.

don't follow theYinYeti's advice, follow mine :P

 

Just be sure that no blankspace is appended at the end of the lines if you copy and paste the commands (otherwise the redirections and pipes wont work right)

Link to comment
Share on other sites

Well, I wound up doing it mostly aru's way, but for those of you/us who might be wanting to do this who were moderators over there, wget will not retrieve the posts you/we made in the password-protected moderators' forum (which is not even visible to regular users). You probably already knew that, but I just figured it out. They get listed and attempted to be downloaded, but wget can't retrieve them.

Link to comment
Share on other sites

Yes, that is true, and I didn't care when I downloaded my posts; but both 'wget' and 'lynx -source' have options to use the cookies of netscape and thus allowing autologin (I never tried that, but I know that the posibility exists).

 

... but who cares now, the anon told me that old board went down today :(

 

I feel hapy for having saved yesterday all the posts from Tips&Tricks and FAQs&Howtos forums :bitter smile:

Link to comment
Share on other sites

... but who cares now, the anon told me that old board went  down today  :( 

 

I feel hapy for having saved yesterday all the posts from Tips&Tricks and FAQs&Howtos forums :bitter smile:

 

It sure did. It was working working earlier today. Wonder if Tom just vanished off the face of the earth. The first glimpse of a message you get when it is redirecting you is "Site suspended..."

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
 Share

×
×
  • Create New...