简体   繁体   中英

How to crawl 1 million documents daily from web using apache Nutch 2.3

I have configured apache nutch 2.3 with hadoop 1.2.1 and hbase 0.94.x. I have to crawl web for few weeks. About 1 million document are required to be crawled. I have four node hadoop cluster. Before this configuration, I setup nutch on single machine and crawled some documents. But is rate of crawling was not more than 50k to 80k. What should be the configuration of nutch so that it could crawl required amount of documents daily.

In general, you can set bigger TopN and also change <name>http.content.limit</name> in nutch-site.xml to -1.

Hope this helps,

Le Quoc Do

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM