简体   繁体   中英

how to inject urls found during crawl into nutch seed list

I have integrated nutch 1.13 along with solr-6.6.0 on CentOS Linux release 7.3.1611 I had given about 10 urls in seedlist which is at /usr/local/apache-nutch-1.13/urls/seed.txt I followed the tutorial The command I used is

/usr/local/apache-nutch-1.13/bin/crawl -i -D solr.server.url=httpxxx:8983/solr/nutch/ /usr/local/apache-nutch-1.13/urls/ crawl 100

  1. It seems to run for one or two hour. and i get corresponding results in solr. but during crawling phase alot of urls seem to be fetched and parsed in the terminal screen. Why aren't they being added to seedlist.?

2.How to know whether my crawldb is growing ? It's been about a month and the only results i get on solr are from the seedlist and its links.

3.I have set above command in crontab -e and plesk scheduled tasks. Now I get same links many times in in return for search query. How to avoid duplicate results in solr?

I'm a total newbie and any additional info would be helpful.

1.It seems to run for one or two hour. and i get corresponding results in solr. but during crawling phase alot of urls seem to be fetched and parsed in the terminal screen. Why aren't they being added to seedlist.?

Seed file is never modified by nutch, it just serves as a read only purpose for the injection phase.

2.How to know whether my crawldb is growing ?

You should take a look at the readdb -stats option, where you should get something like this

crawl.CrawlDbReader - Statistics for CrawlDb: test/crawldb
crawl.CrawlDbReader - TOTAL urls: 5584
crawl.CrawlDbReader - shortest fetch interval:    30 days, 00:00:00
crawl.CrawlDbReader - avg fetch interval: 30 days, 01:14:16
crawl.CrawlDbReader - longest fetch interval:     42 days, 00:00:00
crawl.CrawlDbReader - earliest fetch time:        Tue Nov 07 09:50:00 CET 2017
crawl.CrawlDbReader - avg of fetch times: Tue Nov 14 11:26:00 CET 2017
crawl.CrawlDbReader - latest fetch time:  Tue Dec 19 09:45:00 CET 2017
crawl.CrawlDbReader - retry 0:    5584
crawl.CrawlDbReader - min score:  0.0
crawl.CrawlDbReader - avg score:  5.463825E-4
crawl.CrawlDbReader - max score:  1.013
crawl.CrawlDbReader - status 1 (db_unfetched):    4278
crawl.CrawlDbReader - status 2 (db_fetched):      1014
crawl.CrawlDbReader - status 4 (db_redir_temp):   116
crawl.CrawlDbReader - status 5 (db_redir_perm):   19
crawl.CrawlDbReader - status 6 (db_notmodified):  24

A good trick I always do is to put this command inside the crawl script provided by nutch (bin/crawl), inside the loop

for for ((a=1; ; a++))
do
...
> echo "stats"
> __bin_nutch readdb "$CRAWL_PATH"/crawldb -stats
done

It's been about a month and the only results i get on solr are from the seedlist and its links.

Causes are multiple, you should check the output of each phase and see how the funnel goes.

3.I have set above command in crontab -e and plesk scheduled tasks. Now I get same links many times in in return for search query. How to avoid duplicate results in solr?

Guess you've used nutch default solr schema, check for the url vs. id fields. As far as I've workd with, id is the unique identifier of a url (which may content redirects)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM