简体   繁体   中英

Why does Apache Nutch clean job fails with Apache Solr in cloud mode

I'm trying to set up Apache Nutch 1.15 with Apache Solr 7.6.0 in cloud mode. Crawling script ( nutch/bin/crawl ) works fine until the cleaning job ( CleaningJob.java ) starts. Then it fails with no reason ( reason: NA ).

I've set up successfully the same versions of Nutch and Solr, but with Sorl in standalone mode.

I am starting Solr in cloud mode with the following commands:

solr/bin/solr start -cloud -p 8983 -s "solr/cloud/node1/solr"
solr/bin/solr start -cloud -p 7574 -s "solr/cloud/node2/solr" -z localhost:9983

And I am starting the crawling process with the following command:

nutch/bin/crawl -i -s nutch/urls/ --num-threads 400 --hostdbupdate --hostdbgenerate --num-tasks 16 --sitemaps-from-hostdb once niche-crawl 8

It fails on the cleaning job. :

nutch/bin/nutch clean niche-crawl/crawldb

With an exception:

No exchange was configured. The documents will be routed to all index writers.
SolrIndexer: deleting 1000/1000 documents
SolrIndexer: deleting 1000/1000 documents
ERROR CleaningJob: java.lang.RuntimeException: CleaningJob did not succeed, job status:FAILED, reason: NA

        at org.apache.nutch.indexer.CleaningJob.delete(CleaningJob.java:169)
        at org.apache.nutch.indexer.CleaningJob.run(CleaningJob.java:197)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.indexer.CleaningJob.main(CleaningJob.java:208)

Here is my index-writers.xml for Solr in cloud mode:

    <writer id="indexer_solr_1" class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
        <parameters>
          <param name="type" value="cloud"/>
          <param name="url" value="http://localhost:8983/solr"/>
          <param name="collection" value="nutch"/>
          <param name="weight.field" value=""/>
          <param name="commitSize" value="1000"/>
          <param name="auth" value="true"/>
          <param name="username" value="solr"/>
          <param name="password" value="password"/>
        </parameters>
        <mapping>
          <copy>
            <!-- <field source="content" dest="search"/> -->
            <!-- <field source="title" dest="title,search"/> -->
          </copy>
          <rename>
            <field source="metatag.description" dest="description"/>
            <field source="metatag.keywords" dest="keywords"/>
          </rename>
          <remove>
            <field source="segment"/>
          </remove>
        </mapping>
      </writer>

Try upgrading to Nutch version 1.16. This sounds like a known bug https://issues.apache.org/jira/browse/NUTCH-2731 which is fixed in 1.16, see https://apache.org/dist/nutch/1.16/CHANGES.txt

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM