简体   繁体   中英

How to update the fetch status in crawldb in apache nutch?

I did web crawling using apache nutch..... I have fetched for two rounds. It generated a crawl db containg 21 urls as fetched status and 537 url as unfetched status. I want to update the status of all the links in crawldb as fetched for some reason. Is there any way to update the status?

I found answer to my question and wanted to share with you all. After fetching two rounds I have updated the db with command 'bin/nutch updatedb crawl/crawldb $s2'. Then the db will be updated with new urls and with status as 'unfetched'. But if do 'bin/nutch updatedb crawl/crawldb $s2 -noAdditions', it will not add new urls to the db and make already existing urls status as 'fetched'.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM