[英]Nutch does not Index on Elasticsearch correctly using Mongodb
I am running Nutch 2.3.1, Mongodb 3.2.9, and Elasticsearch 2.4.1. 我正在运行Nutch 2.3.1,Mongodb 3.2.9和Elasticsearch 2.4.1。 I have followed a mix of this tutorial: 我已经学习了本教程:
https://qbox.io/blog/scraping-the-web-with-nutch-for-elasticsearch https://qbox.io/blog/scraping-the-web-with-nutch-for-elasticsearch
and this tutorial: 和本教程:
http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/ http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/
In order to create a web crawling tool using those aforementioned 3 pieces of software. 为了使用上述3个软件创建Web爬行工具。
Everything works great until it comes down to indexing... as soon as I use the index command from nutch: 一旦我使用来自nutch的index命令,一切都很有效,直到归结为索引...
# bin/nutch index elasticsearch -all
this happens: 有时候是这样的:
IndexingJob: starting
Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port (default 9300)
elastic.index : elastic index command
elastic.max.bulk.docs : ealstic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
IndexingJob: done.
My nutch-site.xml: 我的nutch-site.xml:
<configuration>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.mongodb.store.MongoStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>http.agent.name</name>
<value>AOssama Crawler</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-(http|httpclient)|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-elastic|nutch-extensionpoints|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
</property>
<property>
<name>elastic.host</name>
<value>localhost</value>
</property>
<property>
<name>elastic.cluster</name>
<value>aossama</value>
</property>
<property>
<name>elastic.index</name>
<value>nutch</value>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
</property>
<property>
<name>http.content.limit</name>
<value>6553600</value>
</property>
</configuration>
I also looked into the ElasticIndexWriter.java code and noticed near line 250 the class that calls the ElasticIndexWriter. 我还查看了ElasticIndexWriter.java代码,并在第250行附近注意到调用ElasticIndexWriter的类。 I'm digging into that further now, but I'm completely lost as to why this isn't working with Mongo. 我现在正在深入研究,但我完全不知道为什么这不适用于Mongo。 I'm about to give up and try with Hbase as much as I dislike it. 我即将放弃并尝试使用Hbase,因为我不喜欢它。
Thanks! 谢谢!
Joe 乔
After a lot of trouble I got it working. 经过很多麻烦我得到了它的工作。 I ended up using ES 1.4.4, nutch 2.3.1, mongodb 3.10, and JDK 8. 我最终使用ES 1.4.4,nutch 2.3.1,mongodb 3.10和JDK 8。
Many of the issues I went through that remained unanswered in a number of other threads: 我所经历的许多问题在许多其他问题中仍未得到解决:
./bin/nutch index -all
(after you fetch and parse). 对于Nutch 3.2.1它是: ./bin/nutch index -all
(在你获取和解析之后)。 If you run into a solr error, you do not have the correct index funtion in your nutch-site.xml. 如果遇到solr错误,则nutch-site.xml中没有正确的索引功能。 Please, please, please, let me know if you're having any trouble with this. 拜托,拜托,如果您遇到任何问题,请告诉我。 It took me close to 2 full weeks to figure this build out and I know it can be incredibly frustrating. 我花了将近两周的时间来计算这个构建,我知道这可能会令人难以置信地令人沮丧。 PM me or post on this if you're running into issues, I'm sure I can help you work through them. 如果你遇到问题,请告诉我或者发帖,我相信我可以帮助你解决这些问题。
Joe 乔
Nutch supports both elasticsearch 2.2.0 and mongodb via gora plugin in branch is named 2.x (for mongo backend you should open in $NUTCH_HOME/ivy/ivy.xml) Nutch支持elasticsearch 2.2.0和mongodb通过分支中的gora插件命名为2.x(对于mongo后端,你应该在$ NUTCH_HOME / ivy / ivy.xml中打开)
<dependency org="org.apache.gora" name="gora-mongodb" rev="0.6.1" conf="*->default" />
In addition to this there is information how to upgrade elasticsearch in $NUTCH_HOME/src/plugin/indexer-elastic2/howto_upgrade_es.txt 除此之外,还有如何在$ NUTCH_HOME / src / plugin / indexer-elastic2 / howto_upgrade_es.txt中升级elasticsearch的信息。
Upgrade elasticsearch dependency in $NUTCH_HOME/src/plugin/indexer-elastic2/ivy.xml 在$ NUTCH_HOME / src / plugin / indexer-elastic2 / ivy.xml中升级elasticsearch依赖项
Upgrade the Elasticsearch specific dependencies in src/plugin/indexer-elastic2/plugin.xml To get the list of dependencies and their versions execute: 升级src / plugin / indexer-elastic2 / plugin.xml中的Elasticsearch特定依赖项要获取依赖项列表及其版本,请执行:
$ ant -f ./build-ivy.xml
$ ls lib | sed 's/^/ <library name="/g' | sed 's/$/"\/>/g'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.