简体   繁体   English

Nutch没有使用Mongodb正确地索引Elasticsearch

[英]Nutch does not Index on Elasticsearch correctly using Mongodb

I am running Nutch 2.3.1, Mongodb 3.2.9, and Elasticsearch 2.4.1. 我正在运行Nutch 2.3.1,Mongodb 3.2.9和Elasticsearch 2.4.1。 I have followed a mix of this tutorial: 我已经学习了本教程:

https://qbox.io/blog/scraping-the-web-with-nutch-for-elasticsearch https://qbox.io/blog/scraping-the-web-with-nutch-for-elasticsearch

and this tutorial: 和本教程:

http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/ http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/

In order to create a web crawling tool using those aforementioned 3 pieces of software. 为了使用上述3个软件创建Web爬行工具。

Everything works great until it comes down to indexing... as soon as I use the index command from nutch: 一旦我使用来自nutch的index命令,一切都很有效,直到归结为索引...

# bin/nutch index elasticsearch -all

this happens: 有时候是这样的:

IndexingJob: starting
Active IndexWriters :
ElasticIndexWriter
        elastic.cluster : elastic prefix cluster
        elastic.host : hostname
        elastic.port : port (default 9300)
        elastic.index : elastic index command
        elastic.max.bulk.docs : ealstic bulk index doc counts. (default 250)
        elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)

IndexingJob: done.

My nutch-site.xml: 我的nutch-site.xml:

<configuration>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.mongodb.store.MongoStore</value>
    <description>Default class for storing data</description>
  </property>
  <property>
    <name>http.agent.name</name>
    <value>AOssama Crawler</value>
  </property>

  <property>
    <name>plugin.includes</name>
    <value>protocol-(http|httpclient)|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-elastic|nutch-extensionpoints|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
  </property>
  <property>
    <name>elastic.host</name>
    <value>localhost</value>
  </property>

  <property>
    <name>elastic.cluster</name>
    <value>aossama</value>
  </property>

  <property>
    <name>elastic.index</name>
    <value>nutch</value>
  </property>

  <property>
    <name>parser.character.encoding.default</name>
    <value>utf-8</value>
  </property>

  <property>
    <name>http.content.limit</name>
    <value>6553600</value>
  </property>
</configuration>

I also looked into the ElasticIndexWriter.java code and noticed near line 250 the class that calls the ElasticIndexWriter. 我还查看了ElasticIndexWriter.java代码,并在第250行附近注意到调用ElasticIndexWriter的类。 I'm digging into that further now, but I'm completely lost as to why this isn't working with Mongo. 我现在正在深入研究,但我完全不知道为什么这不适用于Mongo。 I'm about to give up and try with Hbase as much as I dislike it. 我即将放弃并尝试使用Hbase,因为我不喜欢它。

Thanks! 谢谢!

Joe

After a lot of trouble I got it working. 经过很多麻烦我得到了它的工作。 I ended up using ES 1.4.4, nutch 2.3.1, mongodb 3.10, and JDK 8. 我最终使用ES 1.4.4,nutch 2.3.1,mongodb 3.10和JDK 8。

Many of the issues I went through that remained unanswered in a number of other threads: 我所经历的许多问题在许多其他问题中仍未得到解决:

  • (this is an easy one but...) MAKE SURE EVERYTHING IS RUNNING. (这是一个简单的但是......)确保一切都在运行。 Make sure elasticsearch is running on the correct machine with the correct port. 确保elasticsearch在具有正确端口的正确计算机上运行。 Make sure you can talk to it. 确保你可以与它交谈。 Make sure MongoDB is up and running on the correct port, make sure you can talk to it. 确保MongoDB在正确的端口上启动并运行,确保您可以与它通信。
  • Use the correct index command. 使用正确的索引命令。 for Nutch 3.2.1 it's: ./bin/nutch index -all (after you fetch and parse). 对于Nutch 3.2.1它是: ./bin/nutch index -all (在你获取和解析之后)。 If you run into a solr error, you do not have the correct index funtion in your nutch-site.xml. 如果遇到solr错误,则nutch-site.xml中没有正确的索引功能。
  • Name your crawler engine the SAME THING in your elasticsearch.yml and your nutch-site.xml. 在您的elasticsearch.yml和您的nutch-site.xml中将您的爬虫引擎命名为SAME THING。 This was huge. 这是巨大的。 This is the main reason I had any error thrown in my index function. 这是我在索引函数中抛出任何错误的主要原因。
  • Versioning. 版本。 I tried to do this with the newer versions of Elasticsearch and frequently ran into problems. 我尝试使用较新版本的Elasticsearch进行此操作,并经常遇到问题。 I am going to attempt to build this on the newest version of Elasticsearch and Mongo and get back to this thread. 我将尝试在最新版本的Elasticsearch和Mongo上构建它,并回到此线程。 Try to use the same build I did first, then attempt the other builds. 尝试使用我先做的相同构建,然后尝试其他构建。 Elasticsearch versioning with nutch seems to be the most important part because of the dependencies regarding gora in the ivy/ivy.xml settings as well as the indexer-elastic/plugin.xml settings. 使用nutch的Elasticsearch版本似乎是最重要的部分,因为ivy / ivy.xml设置中的gora依赖关系以及indexer-elastic / plugin.xml设置。

Please, please, please, let me know if you're having any trouble with this. 拜托,拜托,如果您遇到任何问题,请告诉我。 It took me close to 2 full weeks to figure this build out and I know it can be incredibly frustrating. 我花了将近两周的时间来计算这个构建,我知道这可能会令人难以置信地令人沮丧。 PM me or post on this if you're running into issues, I'm sure I can help you work through them. 如果你遇到问题,请告诉我或者发帖,我相信我可以帮助你解决这些问题。

Joe

Nutch supports both elasticsearch 2.2.0 and mongodb via gora plugin in branch is named 2.x (for mongo backend you should open in $NUTCH_HOME/ivy/ivy.xml) Nutch支持elasticsearch 2.2.0和mongodb通过分支中的gora插件命名为2.x(对于mongo后端,你应该在$ NUTCH_HOME / ivy / ivy.xml中打开)

<dependency org="org.apache.gora" name="gora-mongodb" rev="0.6.1" conf="*->default" />

In addition to this there is information how to upgrade elasticsearch in $NUTCH_HOME/src/plugin/indexer-elastic2/howto_upgrade_es.txt 除此之外,还有如何在$ NUTCH_HOME / src / plugin / indexer-elastic2 / howto_upgrade_es.txt中升级elasticsearch的信息。

  1. Upgrade elasticsearch dependency in $NUTCH_HOME/src/plugin/indexer-elastic2/ivy.xml 在$ NUTCH_HOME / src / plugin / indexer-elastic2 / ivy.xml中升级elasticsearch依赖项

  2. Upgrade the Elasticsearch specific dependencies in src/plugin/indexer-elastic2/plugin.xml To get the list of dependencies and their versions execute: 升级src / plugin / indexer-elastic2 / plugin.xml中的Elasticsearch特定依赖项要获取依赖项列表及其版本,请执行:

$ ant -f ./build-ivy.xml
$ ls lib | sed 's/^/      <library name="/g' | sed 's/$/"\/>/g'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM