简体   繁体   中英

Nutch not crawling url specified in seed.txt

I have just installed nutch integrated with solr and started crawling. but the urls I am specifying in seed.txt nutch is not crawling those url immediately. It's injecting old urls which I may have given earlier but now they are commented out.It looks like nutch is injecting url's in some strange order. What is the reason.also could anybody guide me any book or detailed tutorial on nutch becuase most of the tutorial available are only installation.

As mentioned in an answer to a similar question, the old URLs are still in Nutch's crawldb .

You can nuke your previous runs completely like this user did and start fresh, or you can remove the unwanted URLs a few different ways via CrawlDbMerger :

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM