簡體   English   中英

Nutch 抓取腳本

[英]Nutch Crawl Script

運行 Nutch 1.10,我在使用 Nutch 開發人員提供的爬網腳本時遇到問題:

Usage: crawl [-i|--index] [-D "key=value"] <Seed Dir> <Crawl Dir> <Num     Rounds>
    -i|--index      Indexes crawl results into a configured indexer
    -D              A Java property to pass to Nutch calls
    Seed Dir        Directory in which to look for a seeds file
    Crawl Dir       Directory where the crawl/link/segments dirs are saved
    Num Rounds      The number of rounds to run this crawl for
 Example: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/  2

我想知道是否有人可以讓我對閱讀本文有所了解。 例如:

    -i|--index      **What is the configured indexer? Is this part of Nutch? Or is it an another program like Solr? When I put in -i, what am I doing?**
    -D              **Not sure how these get used in the crawl but the instruction is pretty self-explanatory.**
    Seed Dir        **Self-explanatory but where do I put the directory within Nutch? I created a urls directory (per the instructions) in the apache-nutch-1.10 directory. I've also tried putting it in the apache-nutch-1.10/bin file because that is were the crawl starts from.**
    Crawl Dir       **Is this where the results of the crawl go or is there where the data for the injection to the crawldb goes? If its the latter where do I get said data? The directory starts out empty and never gets filled. Confusing!**
    Num Rounds      **Self-explanatory**

其他問題:爬取的結果去哪兒了? 他們是否必須使用 Solr 核心(或其他一些軟件)? 他們可以直接轉到目錄以便我查看嗎? 它們以什么格式出現?

謝謝!

-i :是 Solr/ElasticSearch 等程序。因此,當您指定 -i 選項時,爬網腳本將運行索引作業,否則將跳過它。

Crawl Dir : 是存儲爬取數據的目錄。 這包括crawldb、segments 和linkdb。 所以基本上所有與爬行相關的數據都放在這里。

爬網的結果進入您指定的 crawlDir。 它存儲為序列文件,並且有查看數據的命令。

您可以在 - https://wiki.apache.org/nutch/CommandLineOptions找到它們。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM