为什么我的 Apache Nutch warc 和 commoncrawldump 在爬行后失败？

Question

I have successfully crawled a website using Nutch and now I want to create a warc from the results.我已经使用 Nutch 成功抓取了一个网站，现在我想根据结果创建一个 warc。 However, running both the warc and commoncrawldump commands fail.但是，同时运行 warc 和 commoncrawldump 命令会失败。 Also, running bin/nutch dump -segement .... works successfully on the same segment folder.此外，运行bin/nutch dump -segement ....在同一段文件夹上成功运行。

I am using nutch v-1.17 and running:我正在使用 nutch v-1.17 并运行：

bin/nutch commoncrawldump -outputDir output/ -segment crawl/segments

The error from hadoop.log is ERROR tools.CommonCrawlDataDumper - No segment directories found in my/path/ despite having just ran a crawl there. hadoop.log 中的ERROR tools.CommonCrawlDataDumper - No segment directories found in my/path/是ERROR tools.CommonCrawlDataDumper - No segment directories found in my/path/尽管刚刚在那里运行了爬网，但ERROR tools.CommonCrawlDataDumper - No segment directories found in my/path/ 。

Answer 1

Inside the segments folder were segments from a previous crawl that were throwing up the error.在segments 文件夹内是来自之前爬行的那些抛出错误的片段。 They did not contain all the segment data as I believe the crawl was cancelled/finished early.它们不包含所有段数据，因为我认为爬网提前取消/完成。 This caused the entire process to fail.这导致整个过程失败。 Deleting all those files and starting anew fixed the issue.删除所有这些文件并重新开始解决了这个问题。

为什么我的 Apache Nutch warc 和 commoncrawldump 在爬行后失败？

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-09-15 12:59:13

为什么我的 Apache Nutch warc 和 commoncrawldump 在爬行后失败？

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-09-15 12:59:13

解决方案1
0 已采纳 2020-09-15 12:59:13