I am running Nutch 2.3.1, Mongodb 3.2.9, and Elasticsearch 2.4.1. I have followed a mix of this tutorial:
https://qbox.io/blog/scraping-the-web-with-nutch-for-elasticsearch
and this tutorial:
http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/
In order to create a web crawling tool using those aforementioned 3 pieces of software.
Everything works great until it comes down to indexing... as soon as I use the index command from nutch:
# bin/nutch index elasticsearch -all
this happens:
IndexingJob: starting
Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port (default 9300)
elastic.index : elastic index command
elastic.max.bulk.docs : ealstic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
IndexingJob: done.
My nutch-site.xml:
<configuration>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.mongodb.store.MongoStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>http.agent.name</name>
<value>AOssama Crawler</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-(http|httpclient)|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-elastic|nutch-extensionpoints|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
</property>
<property>
<name>elastic.host</name>
<value>localhost</value>
</property>
<property>
<name>elastic.cluster</name>
<value>aossama</value>
</property>
<property>
<name>elastic.index</name>
<value>nutch</value>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
</property>
<property>
<name>http.content.limit</name>
<value>6553600</value>
</property>
</configuration>
I also looked into the ElasticIndexWriter.java code and noticed near line 250 the class that calls the ElasticIndexWriter. I'm digging into that further now, but I'm completely lost as to why this isn't working with Mongo. I'm about to give up and try with Hbase as much as I dislike it.
Thanks!
Joe
After a lot of trouble I got it working. I ended up using ES 1.4.4, nutch 2.3.1, mongodb 3.10, and JDK 8.
Many of the issues I went through that remained unanswered in a number of other threads:
./bin/nutch index -all
(after you fetch and parse). If you run into a solr error, you do not have the correct index funtion in your nutch-site.xml. Please, please, please, let me know if you're having any trouble with this. It took me close to 2 full weeks to figure this build out and I know it can be incredibly frustrating. PM me or post on this if you're running into issues, I'm sure I can help you work through them.
Joe
Nutch supports both elasticsearch 2.2.0 and mongodb via gora plugin in branch is named 2.x (for mongo backend you should open in $NUTCH_HOME/ivy/ivy.xml)
<dependency org="org.apache.gora" name="gora-mongodb" rev="0.6.1" conf="*->default" />
In addition to this there is information how to upgrade elasticsearch in $NUTCH_HOME/src/plugin/indexer-elastic2/howto_upgrade_es.txt
Upgrade elasticsearch dependency in $NUTCH_HOME/src/plugin/indexer-elastic2/ivy.xml
Upgrade the Elasticsearch specific dependencies in src/plugin/indexer-elastic2/plugin.xml To get the list of dependencies and their versions execute:
$ ant -f ./build-ivy.xml
$ ls lib | sed 's/^/ <library name="/g' | sed 's/$/"\/>/g'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.