简体繁体 English

无论如何，是否有日志记录Nutch爬网中被“忽略”的网址列表？

[英]Is there anyway to log the list of urls 'ignored' in Nutch crawl?

原文 2013-03-16 18:36:43 8 1 apache/ solr/ web-crawler/ nutch

I am using Nutch to crawl a list of URLS specified in the seed file with depth 100 and topN 10,000 to ensure a full crawl. 我正在使用Nutch来抓取深度为100和topN 10,000的种子文件中指定的URL列表，以确保完全抓取。 Also, I am trying to ignore urls with repeated strings in their path using regex-urlfilter http://rubular.com/r/oSkwqGHrri 另外，我正在尝试使用regex-urlfilter http://rubular.com/r/oSkwqGHrri忽略路径中包含重复字符串的网址

However, I am curious to know which urls have been ignored during crawling. 但是，我很想知道在爬网期间哪些URL被忽略了。 Is there anyway i can log the list of urls "ignored" by Nutch while it crawls? 无论如何，我可以在Nutch爬行时记录“忽略”的网址列表吗？

1 个解决方案

The links can be found by using the following command 可以使用以下命令找到链接

bin/nutch readdb PATH_TO_CRAWL_DB -stats -sort -dump DUMP_FOLDER -format csv bin / nutch readdb PATH_TO_CRAWL_DB -stats -sort -dump DUMP_FOLDER -format csv

this will generate part-00000 file in dump_folder which will contain the url list and their status respectively. 这将在dump_folder中生成part-00000文件，该文件将分别包含URL列表及其状态。

Those with the status of db_unfetched have been ignored by the crawler. 搜寻器将忽略状态为db_unfetched的那些文件。

如何将在爬网期间找到的URL注入到种子种子列表中 - how to inject urls found during crawl into nutch seed list

bin / nutch注入爬网/ crawldb URL不起作用 - bin/nutch inject crawl/crawldb urls not working

如何或在何处运行$ ./nutch注入crawl / crawldb url - How or where to run $ ./nutch inject crawl/crawldb urls

带有Solr 3.4的Nutch 1.4-无法抓取网址，“没有要提取的网址” - Nutch 1.4 with Solr 3.4 - can't crawl URL, “no URLs to fetch”

Nutch执行抓取问题 - Nutch problems executing crawl

Apache Nutch重新启动爬网 - Apache Nutch restart crawl

以编程方式触发Apache Nutch抓取 - Trigger Apache Nutch Crawl Programmatically

使用Apache Nutch抓取图像 - Crawl Image using Apache Nutch

Apache Nutch：没有要获取的 URL - 检查您的种子列表和 URL 过滤器 - Apache Nutch: No URLs to fetch - check your seed list and URL filters

Apache Nutject注入网址 - Apache nutch inject urls

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将在爬网期间找到的URL注入到种子种子列表中 - how to inject urls found during crawl into nutch seed list bin / nutch注入爬网/ crawldb URL不起作用 - bin/nutch inject crawl/crawldb urls not working 如何或在何处运行$ ./nutch注入crawl / crawldb url - How or where to run $ ./nutch inject crawl/crawldb urls 带有Solr 3.4的Nutch 1.4-无法抓取网址，“没有要提取的网址” - Nutch 1.4 with Solr 3.4 - can't crawl URL, “no URLs to fetch” Nutch执行抓取问题 - Nutch problems executing crawl Apache Nutch重新启动爬网 - Apache Nutch restart crawl 以编程方式触发Apache Nutch抓取 - Trigger Apache Nutch Crawl Programmatically 使用Apache Nutch抓取图像 - Crawl Image using Apache Nutch Apache Nutch：没有要获取的 URL - 检查您的种子列表和 URL 过滤器 - Apache Nutch: No URLs to fetch - check your seed list and URL filters Apache Nutject注入网址 - Apache nutch inject urls

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM