简体繁体中英

Nutch not crawling url specified in seed.txt

原文 2015-03-24 16:50:48 5 1 solr/ lucene/ nutch

I have just installed nutch integrated with solr and started crawling. but the urls I am specifying in seed.txt nutch is not crawling those url immediately. It's injecting old urls which I may have given earlier but now they are commented out.It looks like nutch is injecting url's in some strange order. What is the reason.also could anybody guide me any book or detailed tutorial on nutch becuase most of the tutorial available are only installation.

1 answers

As mentioned in an answer to a similar question, the old URLs are still in Nutch's crawldb .

You can nuke your previous runs completely like this user did and start fresh, or you can remove the unwanted URLs a few different ways via CrawlDbMerger :

CLI via bin/nutch mergedb
CLI via bin/nutch updatedb

Nutch didn't crawl all URLs from the seed.txt

Nutch 1.6 doesn't search new entries in seed.txt

Nutch not crawling page content

apache nutch crawling issue

Focused crawling from nutch

web crawling using apache Nutch

Nutch 1.14 - not crawling all links in the page

Crawling authentication based pages using apache nutch

crawling all links of same domain in Nutch

Web crawling specific data using Solr Nutch

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Nutch didn't crawl all URLs from the seed.txt Nutch 1.6 doesn't search new entries in seed.txt Nutch not crawling page content apache nutch crawling issue Focused crawling from nutch web crawling using apache Nutch Nutch 1.14 - not crawling all links in the page Crawling authentication based pages using apache nutch crawling all links of same domain in Nutch Web crawling specific data using Solr Nutch

Related Tags

Nutch not crawling url specified in seed.txt

Question

1 answers

solution1 0 2015-03-24 22:49:32

solution1
0 2015-03-24 22:49:32