简体   繁体   中英

How to crawl and parse only precise data using Nutch?

I'm new to Nutch and crawling. I have installed Nutch 2.0, crawled and indexed the data using Solr 4.5 by following some basic tutorials. Now I don't want to parse all the text content of a page, I want to customize it like Nutch should crawl the page and scrape/fetch only the data related to address because my use case is to crawl URLs and parse only address info as text.

For example, I need to crawl and parse only the text content which has address information, email id, phone number and fax number.

  1. How should I do this? Is there any plugin already available for this?
  2. If I want to write a customized parser for this can anyone help me in this regards?

Checkout NUTCH-1870 a work in progress on a generic XPath plugin for Nutch, the alternative is to write a custom HtmlParseFilter that scrap the data that you want. A good (and simple) example is the headings plugin. Keep in mind that both of this links are for the 1.x branch of Nutch, and you're working with the 2.x although things are different in some degree the logic should be portable, the other alternative is using the 1.x branch.

Based on your comment:

Since you don't know the structure of the webpage, the problem is somehow different: Essentially you'll need to "teach" Nutch how to detect the text you want, based on some regexp or using some library that does address extraction out of plain text like jgeocoder library, you'll need to parse (iterate on every node of the webpage) trying to find something that resembles an address, phone number, fax number, etc. This is kind of similar to what the headings plugin does, but instead of looking for addresses or phone numbers it just finds the title nodes in the HTML structure. This could be a starting point to write some plugin that does what you want, but I don't think there is anything out of the box for this do.

选中[NUTCH-978] ,它引入了一个名为XPath的插件,该插件可允许Nuct的用户处理各种网页,并仅获取用户所需的某些信息,从而使索引更加准确,其内容更加灵活。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM