Nutch并未从seed.txt中检索所有URL

Question

I am new to Nutch and Solr. 我是Nutch和Solr的新手。 Currently I would like to crawl a website and its content is 目前，我想抓取一个网站，其内容是

generated by ASP. 由ASP生成。 Since the content is not static, I created a seed.txt which 由于内容不是静态的，所以我创建了一个seed.txt

contained all the URLs I would like to crawl. 包含我要抓取的所有URL。 For example: 例如：

http://us.abc.com/product/10001
http://us.abc.com/product/10002
http://jp.abc.com/product/10001
http://jp.abc.com/product/10002
...

The regex-urlfilter.txt has this filter: regex-urlfilter.txt具有以下过滤器：

# accept anything else
#+.
+^http://([a-z0-9]*\.)*abc.com/

I used this command to start the crawling: 我使用以下命令开始抓取：

/bin/nutch crawl urls -solr http://abc.com:8983/solr/ -dir crawl -depth 10 -topN 10

The seed.txt content 40,000+ URLs. seed.txt包含40,000多个URL。 However, I found that many of the URLs content are not 但是，我发现许多URL内容都不是

able to be found by Solr. 可以由Solr找到。

Question: 题：

Is this approach for a large seed.txt workable ? 这种方法对于较大的seed.txt是否可行？
How can I check a URL was being crawlered ? 如何检查网址是否已被抓取？
Is seed.txt has a size limitation ? seed.txt是否有大小限制？

Thank you ! 谢谢！

Answer 1

Check out the property db.max.outlinks.per.page in the nutch configuration files. 在nutch配置文件中检出属性db.max.outlinks.per.page 。
The default value for this property is 100 and hence only 100 urls will be picked up from the seeds.txt and rest would be skipped. 此属性的默认值为100，因此，从seeds.txt中只能提取100个网址，其余的将被跳过。
Change this value to a higher number to have all the urls scanned and indexed. 将此值更改为更高的值，以扫描和索引所有URL。

Answer 2

topN indicates how many of the generated links should be fetched. topN指示应提取多少个生成的链接。 You could have 100 links which have been generated , but if you set topN as 12, then only 12 of those links will get fetched, parsed and indexed. 您可能已经生成了100个链接，但是如果将topN设置为12，则将仅获取，解析和索引这些链接中的12个。

Nutch并未从seed.txt中检索所有URL

问题描述

2 个解决方案

解决方案1
4 2012-10-25 07:09:48

解决方案2
0 2013-05-02 07:37:21

Nutch并未从seed.txt中检索所有URL

问题描述

2 个解决方案

解决方案1 4 2012-10-25 07:09:48

解决方案2 0 2013-05-02 07:37:21

解决方案1
4 2012-10-25 07:09:48

解决方案2
0 2013-05-02 07:37:21