I have just installed nutch integrated with solr and started crawling. but the urls I am specifying in seed.txt nutch is not crawling those url immediately. It's injecting old urls which I may have given earlier but now they are commented out.It looks like nutch is injecting url's in some strange order. What is the reason.also could anybody guide me any book or detailed tutorial on nutch becuase most of the tutorial available are only installation.
As mentioned in an answer to a similar question, the old URLs are still in Nutch's crawldb
.
You can nuke your previous runs completely like this user did and start fresh, or you can remove the unwanted URLs a few different ways via CrawlDbMerger :
CLI via bin/nutch mergedb
CLI via bin/nutch updatedb
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.