Running Nutch 1.10 and I'm having trouble using the Crawl Script provided by the Nutch developers:
Usage: crawl [-i|--index] [-D "key=value"] <Seed Dir> <Crawl Dir> <Num Rounds>
-i|--index Indexes crawl results into a configured indexer
-D A Java property to pass to Nutch calls
Seed Dir Directory in which to look for a seeds file
Crawl Dir Directory where the crawl/link/segments dirs are saved
Num Rounds The number of rounds to run this crawl for
Example: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/ 2
I was wondering if anyone can give me some insight into reading this. For instance:
-i|--index **What is the configured indexer? Is this part of Nutch? Or is it an another program like Solr? When I put in -i, what am I doing?**
-D **Not sure how these get used in the crawl but the instruction is pretty self-explanatory.**
Seed Dir **Self-explanatory but where do I put the directory within Nutch? I created a urls directory (per the instructions) in the apache-nutch-1.10 directory. I've also tried putting it in the apache-nutch-1.10/bin file because that is were the crawl starts from.**
Crawl Dir **Is this where the results of the crawl go or is there where the data for the injection to the crawldb goes? If its the latter where do I get said data? The directory starts out empty and never gets filled. Confusing!**
Num Rounds **Self-explanatory**
Other questions: Where do the results of the crawl go? Do they have to go to a Solr core (or some other peice of software)? Can they just go to a directory so I can look at them? What format do they come out?
Thanks!
-i : Is a program like Solr/ElasticSearch etc. So when you specify the -i option, the crawl script runs the index job or else it skips it.
Crawl Dir : is the directory where the crawl data is stored. This includes the crawldb, segments and linkdb. So basically all the data relating to the crawl goes in here.
The results of a crawl go into the crawlDir you specify. It is stored as a sequence file and there are commands to view the data.
You can find them at - https://wiki.apache.org/nutch/CommandLineOptions .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.