如何通过坚果爬网脚本设置topN

Question

I am trying to crawl a webpage whose url is http://def.com/xyz/(say) which has more than 2000 outgoing urls, but when I query the solr it is showing less than 50 documents whereas I am expecting around 2000 documents. 我正在尝试爬网其URL为http://def.com/xyz/(say）的网页，该网页具有超过2000个传出URL，但是当我查询solr时，它显示的文档少于50个，而我预计大约为2000个文件。 I am using the Following query : 我正在使用以下查询：

./crawl urls TestCrawl http://localhost:8983/solr/ -depth 2 -topN 3000

The console output is : 控制台输出为：

Injector: starting at 2014-12-08 21:36:15
Injector: crawlDb: TestCrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Total number of urls rejected by filters: 0
Injector: Total number of urls after normalization: 1
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: URLs merged: 1
Injector: Total new urls injected: 0
Injector: finished at 2014-12-08 21:36:18, elapsed: 00:00:02

I am assuming that somehow nutch is not being able to get topN value from crawl script. 我假设某种程度上，nutch无法从爬网脚本中获取topN值。

Answer 1

Please verify the property db.max.outlinks.per.page in the nutch configuration. 请验证坚果配置中的属性db.max.outlinks.per.page 。 Change this value to a higher number or to -1 to have all the urls crawled and indexed. 将此值更改为一个较大的数字或为-1以对所有URL进行爬网和编制索引。

Hope this helps, 希望这可以帮助，

Le Quoc Do Le Quoc Do

如何通过坚果爬网脚本设置topN

问题描述

1 个解决方案

解决方案1
0 2014-12-09 17:57:24

如何通过坚果爬网脚本设置topN

问题描述

1 个解决方案

解决方案1 0 2014-12-09 17:57:24

解决方案1
0 2014-12-09 17:57:24