繁体 English 中英

使用Apache Nut在Solr中索引结构

[英]Indexing a structure in solr with apache nutch

原文 2016-08-02 14:31:14 7 1 json/ apache/ solr/ web-crawler/ nutch

在二手汽车销售商网站上，有数千辆汽车广告。这是一个典型的广告-> alfa-romeo

如果我抓取所有这些广告页面，所有不同的汽车，我会将所有我不需要的无用文字编入索引，我只想抓取类似

标题，说明，汽车的公里数，功率cv（hp），而不是整个页面，

我正在使用nutch，因为它与solr集成良好，但是nutnut做好了抓取所有内容的准备，并且在插件方面没有找到解决我的问题的好方法。

已经使用了胡说八卦的自定义搜索没有用。

您知道解决我问题的方法吗，我只想抓取特定网站的页面和页面的特定部分，并将其编入索引以进行搜索

也许是另一个与Solr集成良好的爬虫？

泰

还要看看https://issues.apache.org/jira/browse/NUTCH-1870 ，它是Nutch的XPath插件，它将使您能够提取网页中所需的元素并将其存储在各个字段中。

如果您愿意看一下其他爬虫，请看https://github.com/DigitalPebble/storm-crawler/ ，这是用于基于Apache Storm构建自己的爬虫的一组资源。 这种方法的主要好处是可以使用NRT搜寻器。

[英]Indexing nutch crawled data in “Bluemix” solr

[英]Indexing into MeiliSearch - Apache Solr - ElasticSearch

[英]Solr indexing methods and performance

[英]indexing json file using solr

[英]Indexing complex array structure

[英]Indexing json on Solr, it indexed as a List instead of as an item

[英]indexing Json and Atom from googleapi and twitterapi in solr

[英]Loading data into django haystack database indexing with solr

[英]Apache solr bad json response?

[英]Max size of docs sent via JSON to SOLR for indexing

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在“Bluemix”solr中索引nutch已爬网数据索引到 MeiliSearch - Apache Solr - ElasticSearch Solr索引方法和性能使用solr索引json文件索引复杂数组结构在Solr上索引json，它被索引为List而不是项目在solr中从googleapi和twitterapi索引Json和Atom 使用Solr将数据加载到Django Haystack数据库索引中 Apache Solr JSON响应错误？通过JSON发送到SOLR进行索引的文档的最大大小

相关标签