简体繁体 English

使用Nutch检索页面内容

[英]Using Nutch to Retrive Page Contents

原文 2014-09-30 12:50:14 2 1 java/ web-crawler/ nutch

I have a very large list of seeds to be crawled (only those seeds are needed without any deepening). 我要爬的种子列表很大（只需要这些种子即可，无需任何深化）。 How can I use Nutch to retrieve: 如何使用Nutch检索：

the HTML of 的HTML
the text content of 的文字内容
(Preferably) the out-links of （最好）的外链

the seed pages? 种子页？ (without any indexing and integration into any other platform like Solr). （没有任何索引并集成到任何其他平台（如Solr））。

Thanks 谢谢

1 个解决方案

Well, there are many issues you want to address. 好吧，您想解决许多问题。 Below are the issues with their solutions: 以下是其解决方案的问题：

Limiting crawling to seed list : enable the scoring-depth plugin and configure it to allow only 1 level of crawling. 将抓取限制为种子列表 ：启用得分深度插件并将其配置为仅允许1级抓取。
Getting textual content : Nutch does that by default. 获取文本内容 ：默认情况下，Nutch会这样做。
Getting HTML raw data : it is not possible by Nutch 1.9. 获取HTML原始数据 ：Nutch 1.9无法实现。 You need to download Nutch from its trunk repository and build it because the HTML content is scheduled for Nutch's next release (1.10). 您需要从其主干存储库下载Nutch并进行构建，因为HTML内容已计划用于Nutch的下一个版本（1.10）。
Extracting outlinks : you can do that, but you have to write a new indexingFilter to index the outlinks. 提取外链 ：您可以这样做，但是您必须编写一个新的indexingFilter来索引外链。
Doing all of the above without Solr : you can do that. 在没有Solr的情况下执行上述所有操作 ：您可以做到。 However, you have to write a new indexer that stores the extract data in whatever format you want. 但是，您必须编写一个新的索引器，以所需的任何格式存储提取的数据。