简体   繁体   中英

How to Set topN via nutch crawl SCRIPT

I am trying to crawl a webpage whose url is http://def.com/xyz/(say) which has more than 2000 outgoing urls, but when I query the solr it is showing less than 50 documents whereas I am expecting around 2000 documents. I am using the Following query :

./crawl urls TestCrawl http://localhost:8983/solr/ -depth 2 -topN 3000          

The console output is :

Injector: starting at 2014-12-08 21:36:15
Injector: crawlDb: TestCrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Total number of urls rejected by filters: 0
Injector: Total number of urls after normalization: 1
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: URLs merged: 1
Injector: Total new urls injected: 0
Injector: finished at 2014-12-08 21:36:18, elapsed: 00:00:02

I am assuming that somehow nutch is not being able to get topN value from crawl script.

Please verify the property db.max.outlinks.per.page in the nutch configuration. Change this value to a higher number or to -1 to have all the urls crawled and indexed.

Hope this helps,

Le Quoc Do

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM