using redis nosql with a webcrawler

Question

I'm making a simple wikipedia page crawler and writing the details to a remote server running redis.

 1 The crawler asks the server for a page that needs crawling
 2 The crawler loads the page and adds the pages that are found to an internal buffer
 3 When the page has finished being parsed the results are sent to the server

how do i do the following:

keep all pages found on the server, with a flag which states if the page has been crawled or not..

eg

My question is.

How can i ask redis to give me the first link it has with a state of 0 ( not crawled yet ) and then how I can tell redis to change that state to 1 ( after I crawled it )

Answer 1

You can use list to hold page to process

RPUSH mylist "http:// ...."

then you can use lpop to get the first item in the list

LPOP mylist

To keep track of processed page, you can use a set

SADD myset "http://.....

And finally gather wether the adress is in the processed set

SISMEMBER myset "http://...."

using redis nosql with a webcrawler

Question

1 answers

solution1
4 ACCPTED 2011-10-06 11:51:26

using redis nosql with a webcrawler

Question

1 answers

solution1 4 ACCPTED 2011-10-06 11:51:26

solution1
4 ACCPTED 2011-10-06 11:51:26