简体繁体中英

How to crawl 1 million documents daily from web using apache Nutch 2.3

原文 2015-12-01 06:42:25 8 1 hadoop/ web-scraping/ web-crawler/ hbase/ nutch

I have configured apache nutch 2.3 with hadoop 1.2.1 and hbase 0.94.x. I have to crawl web for few weeks. About 1 million document are required to be crawled. I have four node hadoop cluster. Before this configuration, I setup nutch on single machine and crawled some documents. But is rate of crawling was not more than 50k to 80k. What should be the configuration of nutch so that it could crawl required amount of documents daily.

1 answers

In general, you can set bigger TopN and also change <name>http.content.limit</name> in nutch-site.xml to -1.

Hope this helps,

Le Quoc Do

How to Crawl .pdf links using Apache Nutch

Apache nutch in distributed mode not going to crawl from web

Apache Nutch restart crawl

How to customize Apache Nutch 2.3 generate step

How to restrict Apache Nutch 2.3.1 to crawl story content and not side bars

org.apache.nutch.crawl.Crawl missing in nutch 1.9 on hadoop 1.2.1

How to include previously excluded URLS in a nutch crawl

Apache Nutch 1.9 on Hadoop 1.2.1 no Crawl class in jar file

Error while Integrating Apache Nutch 2.3 with Hbase 0.94.14 and Solr 5.2.1

Apache Nutch 2.3: throwing Error Failed with exit value 255

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to Crawl .pdf links using Apache Nutch Apache nutch in distributed mode not going to crawl from web Apache Nutch restart crawl How to customize Apache Nutch 2.3 generate step How to restrict Apache Nutch 2.3.1 to crawl story content and not side bars org.apache.nutch.crawl.Crawl missing in nutch 1.9 on hadoop 1.2.1 How to include previously excluded URLS in a nutch crawl Apache Nutch 1.9 on Hadoop 1.2.1 no Crawl class in jar file Error while Integrating Apache Nutch 2.3 with Hbase 0.94.14 and Solr 5.2.1 Apache Nutch 2.3: throwing Error Failed with exit value 255

Related Tags

How to crawl 1 million documents daily from web using apache Nutch 2.3

Question

1 answers

solution1 1 2016-03-11 19:37:08

solution1
1 2016-03-11 19:37:08