简体   繁体   English

Nutch并未从seed.txt中检索所有URL

[英]Nutch didn't crawl all URLs from the seed.txt

I am new to Nutch and Solr. 我是Nutch和Solr的新手。 Currently I would like to crawl a website and its content is 目前,我想抓取一个网站,其内容是

generated by ASP. 由ASP生成。 Since the content is not static, I created a seed.txt which 由于内容不是静态的,所以我创建了一个seed.txt

contained all the URLs I would like to crawl. 包含我要抓取的所有URL。 For example: 例如:

http://us.abc.com/product/10001
http://us.abc.com/product/10002
http://jp.abc.com/product/10001
http://jp.abc.com/product/10002
...

The regex-urlfilter.txt has this filter: regex-urlfilter.txt具有以下过滤器:

# accept anything else
#+.
+^http://([a-z0-9]*\.)*abc.com/

I used this command to start the crawling: 我使用以下命令开始抓取:

/bin/nutch crawl urls -solr http://abc.com:8983/solr/ -dir crawl -depth 10 -topN 10

The seed.txt content 40,000+ URLs. seed.txt包含40,000多个URL。 However, I found that many of the URLs content are not 但是,我发现许多URL内容都不是

able to be found by Solr. 可以由Solr找到。

Question: 题:

  1. Is this approach for a large seed.txt workable ? 这种方法对于较大的seed.txt是否可行?

  2. How can I check a URL was being crawlered ? 如何检查网址是否已被抓取?

  3. Is seed.txt has a size limitation ? seed.txt是否有大小限制?

Thank you ! 谢谢 !

Check out the property db.max.outlinks.per.page in the nutch configuration files. 在nutch配置文件中检出属性db.max.outlinks.per.page
The default value for this property is 100 and hence only 100 urls will be picked up from the seeds.txt and rest would be skipped. 此属性的默认值为100,因此,从seeds.txt中只能提取100个网址,其余的将被跳过。
Change this value to a higher number to have all the urls scanned and indexed. 将此值更改为更高的值,以扫描和索引所有URL。

topN indicates how many of the generated links should be fetched. topN指示应提取多少个生成的链接。 You could have 100 links which have been generated , but if you set topN as 12, then only 12 of those links will get fetched, parsed and indexed. 您可能已经生成了100个链接,但是如果将topN设置为12,则将仅获取,解析和索引这些链接中的12个。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM