简体繁体 English

如何使用Nutch抓取和解析仅精确数据？

[英]How to crawl and parse only precise data using Nutch?

原文 2015-09-24 09:44:04 4 2 java/ parsing/ solr/ web-crawler/ nutch

I'm new to Nutch and crawling. 我是Nutch和爬行的新手。 I have installed Nutch 2.0, crawled and indexed the data using Solr 4.5 by following some basic tutorials. 我已经按照以下一些基本教程安装了Nutch 2.0，并使用Solr 4.5对数据进行了爬网和建立索引。 Now I don't want to parse all the text content of a page, I want to customize it like Nutch should crawl the page and scrape/fetch only the data related to address because my use case is to crawl URLs and parse only address info as text. 现在，我不想解析页面的所有文本内容，我想对其进行自定义，例如Nutch应该抓取页面并仅抓取/获取与地址相关的数据，因为我的用例是抓取URL并仅分析地址信息作为文字。

For example, I need to crawl and parse only the text content which has address information, email id, phone number and fax number. 例如，我只需要抓取和解析包含地址信息，电子邮件ID，电话号码和传真号码的文本内容。

How should I do this? 我应该怎么做？ Is there any plugin already available for this? 已经有可用的插件了吗？
If I want to write a customized parser for this can anyone help me in this regards? 如果我要为此编写一个自定义的解析器，有人可以在这方面帮助我吗？

2 个解决方案

Checkout NUTCH-1870 a work in progress on a generic XPath plugin for Nutch, the alternative is to write a custom HtmlParseFilter that scrap the data that you want. 签出NUTCH-1870 ，这是针对Nutch的通用XPath插件的一项正在进行的工作，替代方法是编写一个自定义HtmlParseFilter，以抓取所需的数据。 A good (and simple) example is the headings plugin. 标题插件是一个很好的（简单的）示例。 Keep in mind that both of this links are for the 1.x branch of Nutch, and you're working with the 2.x although things are different in some degree the logic should be portable, the other alternative is using the 1.x branch. 请记住，这两个链接都适用于Nutch的1.x分支，并且您正在使用2.x，尽管在某种程度上逻辑应该可移植的逻辑有所不同，另一种选择是使用1.x。科。

Based on your comment: 根据您的评论：

Since you don't know the structure of the webpage, the problem is somehow different: Essentially you'll need to "teach" Nutch how to detect the text you want, based on some regexp or using some library that does address extraction out of plain text like jgeocoder library, you'll need to parse (iterate on every node of the webpage) trying to find something that resembles an address, phone number, fax number, etc. This is kind of similar to what the headings plugin does, but instead of looking for addresses or phone numbers it just finds the title nodes in the HTML structure. 由于您不知道网页的结构，因此问题有所不同：本质上，您需要基于某种正则表达式或使用某些能够从中提取地址的库来“教” Nutch如何检测所需的文本。纯文本（例如jgeocoder库），您需要进行解析（在网页的每个节点上重复），以查找类似于地址，电话号码，传真号码等的内容。这与标题插件的功能类似，但是，除了查找地址或电话号码外，它只是在HTML结构中找到标题节点。 This could be a starting point to write some plugin that does what you want, but I don't think there is anything out of the box for this do. 这可能是编写一些可以满足您需求的插件的起点，但是我认为没有什么可以立即使用的。

选中[NUTCH-978] ，它引入了一个名为XPath的插件，该插件可允许Nuct的用户处理各种网页，并仅获取用户所需的某些信息，从而使索引更加准确，其内容更加灵活。