繁体 English 中英

Nutch抓取命令

[英]Nutch crawl command

原文 2013-10-25 14:07:44 8 1 solr/ web-crawler/ nutch

对于Nutch 2.2.1，我知道两个抓取命令-bin / nutch（逐步），bin / crawl（全部合为一体）

我知道如何为bin/crawl命令指定爬网ID。 同样，如何为bin/nutch命令指定爬网ID？

我问的原因是，我使用指定爬网ID的all-in-one crawl command "bin/crawl"运行了一个大型爬网作业，在Solr中为第9个爬网迭代建立索引时，它中断了。 现在，我只想为中断的第9次迭代运行一个"bin/nutch solrindex"命令，以完成solr索引"bin/nutch solrindex" 。 如何在“ bin/nutch solrindex ”命令中指定crawlID？ 语法是什么？

我将所有爬网数据存储在HBase表“ webpage_test”中

1 个解决方案

您可以运行bin / nutch solrindex并在参数中传递爬网和段文件夹。

Nutch将索引所有文档，但不会创建重复文档，因为它将使用ID字段来确定是否已插入文档。

Nutch 1.11抓取问题

[英]Nutch 1.11 crawl Issue

获取Nutch抓取的状态？

[英]Getting status of a Nutch crawl?

如何在Nutch中抓取图像？

[英]How to crawl images in Nutch?

Nutch抓取无效

[英]Nutch Crawl does not working

Nutch 抓取脚本

[英]Nutch Crawl Script

Nutch抓取-删除每个抓取含义上的细分

[英]Nutch Crawl - Deleting segments on each crawl implications

螺母会爬过表格吗？

[英]Does nutch crawl over forms?

抓取Deep Web中的Nutch 2.3.1

[英]Nutch 2.3.1 in crawl Deep Web

使用Apache Nutch抓取图像

[英]Crawl Image using Apache Nutch

在Nutch 1.11中找不到Nutch conf / crawl-urlfilter.txt

[英]Nutch conf/crawl-urlfilter.txt not found in Nutch 1.11

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Nutch 1.11抓取问题获取Nutch抓取的状态？如何在Nutch中抓取图像？ Nutch抓取无效 Nutch 抓取脚本 Nutch抓取-删除每个抓取含义上的细分螺母会爬过表格吗？抓取Deep Web中的Nutch 2.3.1 使用Apache Nutch抓取图像在Nutch 1.11中找不到Nutch conf / crawl-urlfilter.txt

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM