简体   繁体   中英

Nutch Crawl Script

Running Nutch 1.10 and I'm having trouble using the Crawl Script provided by the Nutch developers:

Usage: crawl [-i|--index] [-D "key=value"] <Seed Dir> <Crawl Dir> <Num     Rounds>
    -i|--index      Indexes crawl results into a configured indexer
    -D              A Java property to pass to Nutch calls
    Seed Dir        Directory in which to look for a seeds file
    Crawl Dir       Directory where the crawl/link/segments dirs are saved
    Num Rounds      The number of rounds to run this crawl for
 Example: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/  2

I was wondering if anyone can give me some insight into reading this. For instance:

    -i|--index      **What is the configured indexer? Is this part of Nutch? Or is it an another program like Solr? When I put in -i, what am I doing?**
    -D              **Not sure how these get used in the crawl but the instruction is pretty self-explanatory.**
    Seed Dir        **Self-explanatory but where do I put the directory within Nutch? I created a urls directory (per the instructions) in the apache-nutch-1.10 directory. I've also tried putting it in the apache-nutch-1.10/bin file because that is were the crawl starts from.**
    Crawl Dir       **Is this where the results of the crawl go or is there where the data for the injection to the crawldb goes? If its the latter where do I get said data? The directory starts out empty and never gets filled. Confusing!**
    Num Rounds      **Self-explanatory**

Other questions: Where do the results of the crawl go? Do they have to go to a Solr core (or some other peice of software)? Can they just go to a directory so I can look at them? What format do they come out?

Thanks!

-i : Is a program like Solr/ElasticSearch etc. So when you specify the -i option, the crawl script runs the index job or else it skips it.

Crawl Dir : is the directory where the crawl data is stored. This includes the crawldb, segments and linkdb. So basically all the data relating to the crawl goes in here.

The results of a crawl go into the crawlDir you specify. It is stored as a sequence file and there are commands to view the data.

You can find them at - https://wiki.apache.org/nutch/CommandLineOptions .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM