简体繁体 English

使用Apache Nut在Solr中索引结构

[英]Indexing a structure in solr with apache nutch

原文 2016-08-02 14:31:14 9 1 json/ apache/ solr/ web-crawler/ nutch

In a 2nd hand car seller website there is thousands of cars ads This is a typical ad -> alfa-romeo 在二手汽车销售商网站上，有数千辆汽车广告。这是一个典型的广告-> alfa-romeo

If I crawl all these ads pages, all diferent cars, I index all these useless text that I dont want, i would like to just crawl something like 如果我抓取所有这些广告页面，所有不同的汽车，我会将所有我不需要的无用文字编入索引，我只想抓取类似

title, description, km of the car, power cv(hp), not the whole page, 标题，说明，汽车的公里数，功率cv（hp），而不是整个页面，

Im using nutch since it has good integration with solr but nutch its prepared to crawl everything, and in terms of plugins didnt found a good one to solve my problem. 我正在使用nutch，因为它与solr集成良好，但是nutnut做好了抓取所有内容的准备，并且在插件方面没有找到解决我的问题的好方法。

Already used nutch-custom-search didnt worked. 已经使用了胡说八卦的自定义搜索没有用。

Do you know something to solve my problem, I just want to crawl the pages of a specific website, and just specific parts of the pages, and index it to solr 您知道解决我问题的方法吗，我只想抓取特定网站的页面和页面的特定部分，并将其编入索引以进行搜索

maybe another crawler with good integration with solr? 也许是另一个与Solr集成良好的爬虫？

Ty 泰

1 个解决方案

Take also a look at https://issues.apache.org/jira/browse/NUTCH-1870 which is an XPath plugin for Nutch, this will allow you to extract the desired elements of the webpage and store this in individual fields. 还要看看https://issues.apache.org/jira/browse/NUTCH-1870 ，它是Nutch的XPath插件，它将使您能够提取网页中所需的元素并将其存储在各个字段中。

If you're willing to take a look at a different crawler, take a look at https://github.com/DigitalPebble/storm-crawler/ which is a set of resources for building your own crawler based on Apache Storm. 如果您愿意看一下其他爬虫，请看https://github.com/DigitalPebble/storm-crawler/ ，这是用于基于Apache Storm构建自己的爬虫的一组资源。 The main gain with this approach is that is a NRT crawler. 这种方法的主要好处是可以使用NRT搜寻器。