简体   繁体   English

获取Nutch抓取的状态?

[英]Getting status of a Nutch crawl?

I've set up Nutch and gave it a seedlist of URLs to crawl. 我已经设置了Nutch,并为其提供了要抓取的URL的种子列表。 I configured it such that it will not crawl anything outside of my seed list. 我对其进行了配置,以使其不会对种子列表之外的任何内容进行爬网。 The seed list contains ~1.5 million urls. 种子列表包含约150万个URL。 I followed the guide and kicked off nutch like so: 我按照指南进行操作,像这样开始做事:

bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments
s1=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb $s1 -addBinaryContent -base64

Aside: I really wish I knew how to crawl and index at the same time (eg, crawl a page -> index it, crawl next page), because I currently have to wait for this entire crawl to finish before anything is indexed at all. 旁:我真的希望我知道如何同时进行爬网和建立索引(例如,对页面进行爬网->对其进行索引,对下一页进行爬网),因为我目前必须等待整个爬网完成才能对所有内容进行索引。

Anyway, right now, from checking the hadoop.log, I believe I've crawled about 40k links in 48 hours. 无论如何,现在,从检查hadoop.log来看,我相信我已经在48小时内抓取了大约4万个链接。 However, I'd like to make sure that it's grabbing all the content correctly. 但是,我想确保它正确地捕获了所有内容。 I'd also like to see which links have been crawled, and which links are left. 我还想看看哪些链接已被爬网,哪些链接还剩下。 I've read all the documentation and I can't seem to figure out how to get the status of a Nutch crawl unless it was started as a job. 我已经阅读了所有文档,并且似乎无法弄清楚如何获得Nutch爬网的状态,除非它是作为工作启动的。

I'm running Nutch 1.10 with Solr 4.10. 我在运行Solr 4.10的Nutch 1.10。

As of now, there is no way in which you could see the status of a crawl while it is being fetched apart from the log. 到目前为止,在从日志中提取爬网的过程中,您无法看到爬网的状态。 You can query a crawldb only after it fetch-parse-updatedb jobs are over. 只有在fetch-parse-updatedb作业结束​​后,您才能查询crawldb。

And I think you are missing the bin/nutch updatedb job before running bin/nutch solrindex. 而且我认为您在运行bin / nutch solrindex之前缺少了bin / nutch Updatedb作业。

As you have mentioned, it seems like you are not using the ./bin/crawl script but calling each job individually. 如前所述,您似乎没有使用./bin/crawl脚本,而是分别调用每个作业。

For crawls as large as yours, one way I could think of is by using the ./bin/crawl script which, by default, generates 50k urls for fetching per iteration. 对于像您一样大的爬网,我想到的一种方法是使用./bin/crawl脚本,默认情况下,该脚本会生成50k url,以供每次迭代获取。 And after every iteration you could use the: 在每次迭代之后,您可以使用:

./bin/nutch readdb <crawl_db> -stats

command given at https://wiki.apache.org/nutch/CommandLineOptions to check the crawldb status. 通过https://wiki.apache.org/nutch/CommandLineOptions给出的命令来检查crawldb状态。

If you want to check updates more frequently then change(lower) the '-topN' parameter(which is passed to the generate job) in the ./bin/crawl script. 如果要更频繁地检查更新,请在./bin/crawl脚本中更改(降低)“-topN”参数(传递给生成作业)。 And now by varying the number of iterations you whould be able to crawl your entire seedlist. 现在,通过更改迭代次数,您就可以爬网整个种子列表。

Hope this helps :) 希望这可以帮助 :)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM