简体   繁体   中英

How to define the coverage of my nutch crawl?

I've been collecting/crawling a website over the last two weeks. I've used the crawl command setting 100 iterations. The process has just finished. How can I know the coverage of the data crawled? I really don't expect an exact number, but I'd really like to know approximately how much information remains un-crawled in the website.

You question is a bit ambiguous, if you're trying to get how much data of the entire website you've already crawled this is a hard problem, Nutch has no idea of how big/small is the website(s) you're crawling. You said that you have done 100 iterations, using default settings in the bin/crawl script this means that on each iteration Nutch it is fetching a maximum of 50 000 URLs ( https://github.com/apache/nutch/blob/master/src/bin/crawl#L117 ), but this doesn't mean that your website doesn't have more URLs, just means that this is a configuration on Nutch, and perhaps Nutch haven't even discovered all the URLs. On each iteration Nutch could discover new URLs making the process incremental.

What you can do is execute the bin/nutch readdb command passing the -stats parameter, something like:

$ bin/nutch readdb crawl/crawldb -stats

This should bring an output similar to:

CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 575
retry 0:    569
retry 1:    6
min score:  0.0
avg score:  0.0069252173
max score:  1.049
status 1 (db_unfetched):    391
status 2 (db_fetched):  129
status 3 (db_gone): 53
status 4 (db_redir_temp):   1
status 5 (db_redir_perm):   1
CrawlDb statistics: done

With this info you could know the total amount of URLs discovered and how much of this have been fetched, along with some more useful information.

Thanks, @Jorge. Based on what you've said:

Nutch has no idea of how big/small is the website(s) you're crawling

So, there's no way to calculate that unless you know the size of the website in advance.

Thanks, again.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM