简体   繁体   中英

Nutch crawl command

For Nutch 2.2.1, I am aware of two crawl commands - bin/nutch (step by step), bin/crawl (all in one)

I know how to specify a crawl ID for bin/crawl command. Similarly, how to specify a crawl ID for bin/nutch command?

The reason I am asking is, I ran a large crawl job using all-in-one crawl command "bin/crawl" specifying a crawl ID, it broke while indexing in Solr for 9th crawl iteration. Now, I just want to run one step "bin/nutch solrindex" command for just that interrupted 9th iteration to complete the solr indexing. How should I specify crawlID in " bin/nutch solrindex " command? What is the syntax?

I have all the crawl data stored in a HBase table "webpage_test"

You can run bin/nutch solrindex and pass the crawl and segments folders in the parameters.

Nutch will index all documents but will not create duplicates as it will use the ID field to determine if they have already been inserted.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM