简体   繁体   中英

using redis nosql with a webcrawler

I'm making a simple wikipedia page crawler and writing the details to a remote server running redis.

 1 The crawler asks the server for a page that needs crawling
 2 The crawler loads the page and adds the pages that are found to an internal buffer
 3 When the page has finished being parsed the results are sent to the server 

how do i do the following:

keep all pages found on the server, with a flag which states if the page has been crawled or not..

eg

My question is.

How can i ask redis to give me the first link it has with a state of 0 ( not crawled yet ) and then how I can tell redis to change that state to 1 ( after I crawled it )

You can use list to hold page to process

RPUSH mylist "http:// ...."

then you can use lpop to get the first item in the list

LPOP mylist

To keep track of processed page, you can use a set

SADD myset "http://.....

And finally gather wether the adress is in the processed set

SISMEMBER myset "http://...."

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM