简体   繁体   English

为什么我的 Apache Nutch warc 和 commoncrawldump 在爬行后失败?

[英]Why does my Apache Nutch warc and commoncrawldump fail after crawl?

I have successfully crawled a website using Nutch and now I want to create a warc from the results.我已经使用 Nutch 成功抓取了一个网站,现在我想根据结果创建一个 warc。 However, running both the warc and commoncrawldump commands fail.但是,同时运行 warc 和 commoncrawldump 命令会失败。 Also, running bin/nutch dump -segement .... works successfully on the same segment folder.此外,运行bin/nutch dump -segement ....在同一段文件夹上成功运行。

I am using nutch v-1.17 and running:我正在使用 nutch v-1.17 并运行:

bin/nutch commoncrawldump -outputDir output/ -segment crawl/segments

The error from hadoop.log is ERROR tools.CommonCrawlDataDumper - No segment directories found in my/path/ despite having just ran a crawl there. hadoop.log 中的ERROR tools.CommonCrawlDataDumper - No segment directories found in my/path/ERROR tools.CommonCrawlDataDumper - No segment directories found in my/path/尽管刚刚在那里运行了爬网,但ERROR tools.CommonCrawlDataDumper - No segment directories found in my/path/

Inside the segments folder were segments from a previous crawl that were throwing up the error.在segments 文件夹内是来自之前爬行的那些抛出错误的片段。 They did not contain all the segment data as I believe the crawl was cancelled/finished early.它们不包含所有段数据,因为我认为爬网提前取消/完成。 This caused the entire process to fail.这导致整个过程失败。 Deleting all those files and starting anew fixed the issue.删除所有这些文件并重新开始解决了这个问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM