I'm making a simple wikipedia page crawler and writing the details to a remote server running redis.
1 The crawler asks the server for a page that needs crawling
2 The crawler loads the page and adds the pages that are found to an internal buffer
3 When the page has finished being parsed the results are sent to the server
how do i do the following:
keep all pages found on the server, with a flag which states if the page has been crawled or not..
eg
My question is.
How can i ask redis to give me the first link it has with a state of 0 ( not crawled yet ) and then how I can tell redis to change that state to 1 ( after I crawled it )
You can use list to hold page to process
RPUSH mylist "http:// ...."
then you can use lpop to get the first item in the list
LPOP mylist
To keep track of processed page, you can use a set
SADD myset "http://.....
And finally gather wether the adress is in the processed set
SISMEMBER myset "http://...."
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.