简体繁体 English

如何配置Nutch只抓取seeklist中的URL？（无需向后爬行）

[英]How to config Nutch to crawl only the URLs in seeklist? (no crawl back need)

原文 2013-01-10 03:07:41 2 2 solr/ nutch

I have a url seedlist contains more than 100000 urls. 我有一个网址种子列表，其中包含超过100000个网址。 I know that nutch will crawl not only the urls in the seedlist but also any url links found inside the websites. 我知道，nutch不仅会抓取种子列表中的网址，还会抓取网站内找到的所有网址链接。 However, I would like to know is there any way to stop this behavior ? 但是，我想知道有什么方法可以阻止这种行为吗？ So that only the urls specified in the seedlist are needed to be crawled. 因此，仅需要对种子列表中指定的url进行爬网。

2 个解决方案

In your nutch-site.xml configuration , set the "db.ignore.external.links" property to true. 在您的nutch-site.xml配置中，将“ db.ignore.external.links”属性设置为true。

This will ignore any urls to domains outside the injected list. 这将忽略注入列表以外的任何网址。

If you are using the crawl command check for the depth parameter. 如果使用爬网命令，请检查depth参数。

-depth depth indicates the link depth from the root page that should be crawled. -depth depth表示从应爬网的根页面开始的链接深度。

Using this you can control what level of depth you need Nutch to crawl. 使用此功能，您可以控制Nutch爬行所需的深度级别。 Having a value of 1 probably would limit it to the base page only. 值为1可能会将其限制为仅基本页面。

如何在Nutch中抓取图像？ - How to crawl images in Nutch?

如何或在何处运行$ ./nutch注入crawl / crawldb url - How or where to run $ ./nutch inject crawl/crawldb urls

如何将在爬网期间找到的URL注入到种子种子列表中 - how to inject urls found during crawl into nutch seed list

无论如何，是否有日志记录Nutch爬网中被“忽略”的网址列表？ - Is there anyway to log the list of urls 'ignored' in Nutch crawl?

如何使用Nutch抓取和解析仅精确数据？ - How to crawl and parse only precise data using Nutch?

如何通过坚果爬网脚本设置topN - How to Set topN via nutch crawl SCRIPT

Nutch 1.11抓取问题 - Nutch 1.11 crawl Issue

Nutch并未从seed.txt中检索所有URL - Nutch didn't crawl all URLs from the seed.txt

获取Nutch抓取的状态？ - Getting status of a Nutch crawl?

Nutch抓取无效 - Nutch Crawl does not working

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Nutch中抓取图像？ - How to crawl images in Nutch? 如何或在何处运行$ ./nutch注入crawl / crawldb url - How or where to run $ ./nutch inject crawl/crawldb urls 如何将在爬网期间找到的URL注入到种子种子列表中 - how to inject urls found during crawl into nutch seed list 无论如何，是否有日志记录Nutch爬网中被“忽略”的网址列表？ - Is there anyway to log the list of urls 'ignored' in Nutch crawl? 如何使用Nutch抓取和解析仅精确数据？ - How to crawl and parse only precise data using Nutch? 如何通过坚果爬网脚本设置topN - How to Set topN via nutch crawl SCRIPT Nutch 1.11抓取问题 - Nutch 1.11 crawl Issue Nutch并未从seed.txt中检索所有URL - Nutch didn't crawl all URLs from the seed.txt 获取Nutch抓取的状态？ - Getting status of a Nutch crawl? Nutch抓取无效 - Nutch Crawl does not working

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM