简体繁体 English

Nutch抓取命令

[英]Nutch crawl command

原文 2013-10-25 14:07:44 2 1 solr/ web-crawler/ nutch

For Nutch 2.2.1, I am aware of two crawl commands - bin/nutch (step by step), bin/crawl (all in one) 对于Nutch 2.2.1，我知道两个抓取命令-bin / nutch（逐步），bin / crawl（全部合为一体）

I know how to specify a crawl ID for bin/crawl command. 我知道如何为bin/crawl命令指定爬网ID。 Similarly, how to specify a crawl ID for bin/nutch command? 同样，如何为bin/nutch命令指定爬网ID？

The reason I am asking is, I ran a large crawl job using all-in-one crawl command "bin/crawl" specifying a crawl ID, it broke while indexing in Solr for 9th crawl iteration. 我问的原因是，我使用指定爬网ID的all-in-one crawl command "bin/crawl"运行了一个大型爬网作业，在Solr中为第9个爬网迭代建立索引时，它中断了。 Now, I just want to run one step "bin/nutch solrindex" command for just that interrupted 9th iteration to complete the solr indexing. 现在，我只想为中断的第9次迭代运行一个"bin/nutch solrindex"命令，以完成solr索引"bin/nutch solrindex" 。 How should I specify crawlID in " bin/nutch solrindex " command? 如何在“ bin/nutch solrindex ”命令中指定crawlID？ What is the syntax? 语法是什么？

I have all the crawl data stored in a HBase table "webpage_test" 我将所有爬网数据存储在HBase表“ webpage_test”中

1 个解决方案

You can run bin/nutch solrindex and pass the crawl and segments folders in the parameters. 您可以运行bin / nutch solrindex并在参数中传递爬网和段文件夹。

Nutch will index all documents but will not create duplicates as it will use the ID field to determine if they have already been inserted. Nutch将索引所有文档，但不会创建重复文档，因为它将使用ID字段来确定是否已插入文档。