简体   繁体   English

索引使用Elasticsearch从Apache Nut抓取的数据?

[英]index crawled data from Apache nutch using elasticsearch?

I have apache nutch 1.7 and Elasticsearch 1.4.4 on aws ec2 ubuntu instance. 我在AWS ec2 ubuntu实例上有Apache Nuch 1.7和Elasticsearch 1.4.4。 I crawled data using Nutch but how we can index data using elasticsearch? 我使用Nutch抓取数据,但是如何使用Elasticsearch索引数据? No official documentation is available related to it. 没有可用的官方文档。

Enable elasticsearch indexer in the configuration. 在配置中启用elasticsearch indexer。 add the elastic-indexer to the plugin linclude property list. 将elastic-indexer添加到插件linclude属性列表中。 see below: 见下文:

    <property>
            <name>plugin.includes</name>
            <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    </property>

In your nutch-site.xml add the following properties: 在您的nutch-site.xml中添加以下属性:

<property>
        <name>plugin.includes</name>
        <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

The above would would make elasticsearch as the indexer. 以上内容将使Elasticsearch成为索引器。 Following is specifying the host of elasticsearch 以下是指定elasticsearch的主机

<property>
        <name>elastic.host</name>
        <value>localhost</value>
</property>

The other optional properties you can set are elastic.port, elastic.cluster, etc. 您可以设置的其他可选属性是elastic.port,elastic.cluster等。

Now you specified that you have already crawled the data and now want to index it, so you can use the 现在,您指定已经抓取了数据,现在希望对其进行索引,因此可以使用

./bin/nutch index <crawldb> -dir <segment_dir>

This would index all the crawled data residing in the segments. 这将索引这些段中所有已爬网的数据。 The you can check your elasticsearch index for the documents. 您可以检查文档的elasticsearch索引。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM