简体   繁体   English

Nutch抓取命令

[英]Nutch crawl command

For Nutch 2.2.1, I am aware of two crawl commands - bin/nutch (step by step), bin/crawl (all in one) 对于Nutch 2.2.1,我知道两个抓取命令-bin / nutch(逐步),bin / crawl(全部合为一体)

I know how to specify a crawl ID for bin/crawl command. 我知道如何为bin/crawl命令指定爬网ID。 Similarly, how to specify a crawl ID for bin/nutch command? 同样,如何为bin/nutch命令指定爬网ID?

The reason I am asking is, I ran a large crawl job using all-in-one crawl command "bin/crawl" specifying a crawl ID, it broke while indexing in Solr for 9th crawl iteration. 我问的原因是,我使用指定爬网ID的all-in-one crawl command "bin/crawl"运行了一个大型爬网作业,在Solr中为第9个爬网迭代建立索引时,它中断了。 Now, I just want to run one step "bin/nutch solrindex" command for just that interrupted 9th iteration to complete the solr indexing. 现在,我只想为中断的第9次迭代运行一个"bin/nutch solrindex"命令,以完成solr索引"bin/nutch solrindex" How should I specify crawlID in " bin/nutch solrindex " command? 如何在“ bin/nutch solrindex ”命令中指定crawlID? What is the syntax? 语法是什么?

I have all the crawl data stored in a HBase table "webpage_test" 我将所有爬网数据存储在HBase表“ webpage_test”中

You can run bin/nutch solrindex and pass the crawl and segments folders in the parameters. 您可以运行bin / nutch solrindex并在参数中传递爬网和段文件夹。

Nutch will index all documents but will not create duplicates as it will use the ID field to determine if they have already been inserted. Nutch将索引所有文档,但不会创建重复文档,因为它将使用ID字段来确定是否已插入文档。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM