简体   繁体   English

无论如何,是否有日志记录Nutch爬网中被“忽略”的网址列表?

[英]Is there anyway to log the list of urls 'ignored' in Nutch crawl?

I am using Nutch to crawl a list of URLS specified in the seed file with depth 100 and topN 10,000 to ensure a full crawl. 我正在使用Nutch来抓取深度为100和topN 10,000的种子文件中指定的URL列表,以确保完全抓取。 Also, I am trying to ignore urls with repeated strings in their path using regex-urlfilter http://rubular.com/r/oSkwqGHrri 另外,我正在尝试使用regex-urlfilter http://rubular.com/r/oSkwqGHrri忽略路径中包含重复字符串的网址

However, I am curious to know which urls have been ignored during crawling. 但是,我很想知道在爬网期间哪些URL被忽略了。 Is there anyway i can log the list of urls "ignored" by Nutch while it crawls? 无论如何,我可以在Nutch爬行时记录“忽略”的网址列表吗?

The links can be found by using the following command 可以使用以下命令找到链接

bin/nutch readdb PATH_TO_CRAWL_DB -stats -sort -dump DUMP_FOLDER -format csv bin / nutch readdb PATH_TO_CRAWL_DB -stats -sort -dump DUMP_FOLDER -format csv

this will generate part-00000 file in dump_folder which will contain the url list and their status respectively. 这将在dump_folder中生成part-00000文件,该文件将分别包含URL列表及其状态。

Those with the status of db_unfetched have been ignored by the crawler. 搜寻器将忽略状态为db_unfetched的那些文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM