简体   繁体   English

如何将在爬网期间找到的URL注入到种子种子列表中

[英]how to inject urls found during crawl into nutch seed list

I have integrated nutch 1.13 along with solr-6.6.0 on CentOS Linux release 7.3.1611 I had given about 10 urls in seedlist which is at /usr/local/apache-nutch-1.13/urls/seed.txt I followed the tutorial The command I used is 我已经在CentOS Linux版本7.3.1611上集成了胡麻1.13和solr-6.6.0,我在种子列表中给出了大约10个网址,它们位于/usr/local/apache-nutch-1.13/urls/seed.txt中,我遵循了教程我使用的命令是

/usr/local/apache-nutch-1.13/bin/crawl -i -D solr.server.url=httpxxx:8983/solr/nutch/ /usr/local/apache-nutch-1.13/urls/ crawl 100 /usr/local/apache-nutch-1.13/bin/crawl -i -D solr.server.url = httpxxx:8983 / solr / nutch / /usr/local/apache-nutch-1.13/urls/抓取100

  1. It seems to run for one or two hour. 它似乎要运行一两个小时。 and i get corresponding results in solr. 我在solr中得到了相应的结果。 but during crawling phase alot of urls seem to be fetched and parsed in the terminal screen. 但是在抓取阶段,似乎在终端屏幕中提取并解析了许多网址。 Why aren't they being added to seedlist.? 为什么不将它们添加到种子列表。

2.How to know whether my crawldb is growing ? 2.如何知道我的crawdb是否在增长? It's been about a month and the only results i get on solr are from the seedlist and its links. 大约一个月了,我在solr上获得的唯一结果是来自种子列表及其链接。

3.I have set above command in crontab -e and plesk scheduled tasks. 3.我在crontab -e中设置了上述命令,并完成了计划任务。 Now I get same links many times in in return for search query. 现在,我多次获得相同的链接,以换取搜索查询。 How to avoid duplicate results in solr? 如何避免solr中出现重复结果?

I'm a total newbie and any additional info would be helpful. 我是一个新手,任何其他信息都将有所帮助。

1.It seems to run for one or two hour. 1.似乎要运行一两个小时。 and i get corresponding results in solr. 我在solr中得到了相应的结果。 but during crawling phase alot of urls seem to be fetched and parsed in the terminal screen. 但是在抓取阶段,似乎在终端屏幕中提取并解析了许多网址。 Why aren't they being added to seedlist.? 为什么不将它们添加到种子列表。

Seed file is never modified by nutch, it just serves as a read only purpose for the injection phase. 种子文件永远不会被螺母修改,它只是注射阶段的只读用途。

2.How to know whether my crawldb is growing ? 2.如何知道我的crawdb是否在增长?

You should take a look at the readdb -stats option, where you should get something like this 您应该看一下readdb -stats选项,在这里应该得到类似的内容

crawl.CrawlDbReader - Statistics for CrawlDb: test/crawldb
crawl.CrawlDbReader - TOTAL urls: 5584
crawl.CrawlDbReader - shortest fetch interval:    30 days, 00:00:00
crawl.CrawlDbReader - avg fetch interval: 30 days, 01:14:16
crawl.CrawlDbReader - longest fetch interval:     42 days, 00:00:00
crawl.CrawlDbReader - earliest fetch time:        Tue Nov 07 09:50:00 CET 2017
crawl.CrawlDbReader - avg of fetch times: Tue Nov 14 11:26:00 CET 2017
crawl.CrawlDbReader - latest fetch time:  Tue Dec 19 09:45:00 CET 2017
crawl.CrawlDbReader - retry 0:    5584
crawl.CrawlDbReader - min score:  0.0
crawl.CrawlDbReader - avg score:  5.463825E-4
crawl.CrawlDbReader - max score:  1.013
crawl.CrawlDbReader - status 1 (db_unfetched):    4278
crawl.CrawlDbReader - status 2 (db_fetched):      1014
crawl.CrawlDbReader - status 4 (db_redir_temp):   116
crawl.CrawlDbReader - status 5 (db_redir_perm):   19
crawl.CrawlDbReader - status 6 (db_notmodified):  24

A good trick I always do is to put this command inside the crawl script provided by nutch (bin/crawl), inside the loop 我一直做的一个好技巧是将此命令放入循环内nutch(bin / crawl)提供的爬网脚本中

for for ((a=1; ; a++))
do
...
> echo "stats"
> __bin_nutch readdb "$CRAWL_PATH"/crawldb -stats
done

It's been about a month and the only results i get on solr are from the seedlist and its links. 大约一个月了,我在solr上获得的唯一结果是来自种子列表及其链接。

Causes are multiple, you should check the output of each phase and see how the funnel goes. 原因多种多样,您应该检查每个阶段的输出并查看漏斗的运行方式。

3.I have set above command in crontab -e and plesk scheduled tasks. 3.我在crontab -e中设置了上述命令,并完成了计划任务。 Now I get same links many times in in return for search query. 现在,我多次获得相同的链接,以换取搜索查询。 How to avoid duplicate results in solr? 如何避免solr中出现重复结果?

Guess you've used nutch default solr schema, check for the url vs. id fields. 猜猜您已经使用了默认的Solr模式,请检查url vs. id字段。 As far as I've workd with, id is the unique identifier of a url (which may content redirects) 据我所知,id是URL的唯一标识符(可能会重定向内容)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM