简体繁体中英

Nutch crawl command

原文 2013-10-25 14:07:44 3 1 solr/ web-crawler/ nutch

For Nutch 2.2.1, I am aware of two crawl commands - bin/nutch (step by step), bin/crawl (all in one)

I know how to specify a crawl ID for bin/crawl command. Similarly, how to specify a crawl ID for bin/nutch command?

The reason I am asking is, I ran a large crawl job using all-in-one crawl command "bin/crawl" specifying a crawl ID, it broke while indexing in Solr for 9th crawl iteration. Now, I just want to run one step "bin/nutch solrindex" command for just that interrupted 9th iteration to complete the solr indexing. How should I specify crawlID in " bin/nutch solrindex " command? What is the syntax?

I have all the crawl data stored in a HBase table "webpage_test"

1 answers

You can run bin/nutch solrindex and pass the crawl and segments folders in the parameters.

Nutch will index all documents but will not create duplicates as it will use the ID field to determine if they have already been inserted.

Nutch 1.11 crawl Issue

Getting status of a Nutch crawl?

How to crawl images in Nutch?

Nutch Crawl does not working

Nutch Crawl Script

Nutch Crawl - Deleting segments on each crawl implications

Does nutch crawl over forms?

Nutch 2.3.1 in crawl Deep Web

Crawl Image using Apache Nutch

Nutch conf/crawl-urlfilter.txt not found in Nutch 1.11

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Nutch 1.11 crawl Issue Getting status of a Nutch crawl? How to crawl images in Nutch? Nutch Crawl does not working Nutch Crawl Script Nutch Crawl - Deleting segments on each crawl implications Does nutch crawl over forms? Nutch 2.3.1 in crawl Deep Web Crawl Image using Apache Nutch Nutch conf/crawl-urlfilter.txt not found in Nutch 1.11

Related Tags

Nutch crawl command

Question

1 answers

solution1 1 2013-10-25 16:13:24

solution1
1 2013-10-25 16:13:24