简体   繁体   中英

Nutch didn't crawl all URLs from the seed.txt

I am new to Nutch and Solr. Currently I would like to crawl a website and its content is

generated by ASP. Since the content is not static, I created a seed.txt which

contained all the URLs I would like to crawl. For example:

http://us.abc.com/product/10001
http://us.abc.com/product/10002
http://jp.abc.com/product/10001
http://jp.abc.com/product/10002
...

The regex-urlfilter.txt has this filter:

# accept anything else
#+.
+^http://([a-z0-9]*\.)*abc.com/

I used this command to start the crawling:

/bin/nutch crawl urls -solr http://abc.com:8983/solr/ -dir crawl -depth 10 -topN 10

The seed.txt content 40,000+ URLs. However, I found that many of the URLs content are not

able to be found by Solr.

Question:

  1. Is this approach for a large seed.txt workable ?

  2. How can I check a URL was being crawlered ?

  3. Is seed.txt has a size limitation ?

Thank you !

Check out the property db.max.outlinks.per.page in the nutch configuration files.
The default value for this property is 100 and hence only 100 urls will be picked up from the seeds.txt and rest would be skipped.
Change this value to a higher number to have all the urls scanned and indexed.

topN indicates how many of the generated links should be fetched. You could have 100 links which have been generated , but if you set topN as 12, then only 12 of those links will get fetched, parsed and indexed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM