簡體 English 中英

如何使用apache Nutch 2.3每天從Web抓取100萬個文檔

[英]How to crawl 1 million documents daily from web using apache Nutch 2.3

原文 2015-12-01 06:42:25 7 1 hadoop/ web-scraping/ web-crawler/ hbase/ nutch

我已經用hadoop 1.2.1和hbase 0.94.x配置了Apache Nuch 2.3。 我必須爬網幾個星期。 大約需要檢索100萬個文檔。 我有四個節點hadoop集群。 在進行此配置之前，我在單台計算機上設置了小程序，並抓取了一些文檔。 但是爬網率不超過50k到80k。 胡須的配置應該是什么，以便它可以每天抓取所需數量的文檔。

1 個解決方案

通常，您可以設置更大的TopN，也可以將nutch-site.xml中的<name>http.content.limit</name>更改為-1。

希望這可以幫助，

Le Quoc Do

如何使用Apache Nutch抓取.pdf鏈接

[英]How to Crawl .pdf links using Apache Nutch

分布式模式下的Apache小問題不會從Web爬網

[英]Apache nutch in distributed mode not going to crawl from web

Apache Nutch重新啟動爬網

[英]Apache Nutch restart crawl

如何自定義Apache Nutch 2.3生成步驟

[英]How to customize Apache Nutch 2.3 generate step

如何限制Apache Nutch 2.3.1爬網內容而不是側邊欄

[英]How to restrict Apache Nutch 2.3.1 to crawl story content and not side bars

在hadoop 1.2.1上的nutch 1.9中缺少org.apache.nutch.crawl.Crawl

[英]org.apache.nutch.crawl.Crawl missing in nutch 1.9 on hadoop 1.2.1

如何在摘要爬網中包括以前排除的URL

[英]How to include previously excluded URLS in a nutch crawl

Apache Nutch 1.9在Hadoop 1.2.1上沒有jar文件中的Crawl類

[英]Apache Nutch 1.9 on Hadoop 1.2.1 no Crawl class in jar file

將Apache Nutch 2.3與Hbase 0.94.14和Solr 5.2.1集成時出錯

[英]Error while Integrating Apache Nutch 2.3 with Hbase 0.94.14 and Solr 5.2.1

Apache Nutch 2.3：拋出錯誤失敗，退出值為255

[英]Apache Nutch 2.3: throwing Error Failed with exit value 255

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 如何使用Apache Nutch抓取.pdf鏈接分布式模式下的Apache小問題不會從Web爬網 Apache Nutch重新啟動爬網如何自定義Apache Nutch 2.3生成步驟如何限制Apache Nutch 2.3.1爬網內容而不是側邊欄在hadoop 1.2.1上的nutch 1.9中缺少org.apache.nutch.crawl.Crawl 如何在摘要爬網中包括以前排除的URL Apache Nutch 1.9在Hadoop 1.2.1上沒有jar文件中的Crawl類將Apache Nutch 2.3與Hbase 0.94.14和Solr 5.2.1集成時出錯 Apache Nutch 2.3：拋出錯誤失敗，退出值為255

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM