简体   繁体   English

为什么 Apache Nutch 清理作业在云模式下使用 Apache Solr 失败

[英]Why does Apache Nutch clean job fails with Apache Solr in cloud mode

I'm trying to set up Apache Nutch 1.15 with Apache Solr 7.6.0 in cloud mode.我正在尝试在云模式下使用 Apache Solr 7.6.0 设置 Apache Nutch 1.15。 Crawling script ( nutch/bin/crawl ) works fine until the cleaning job ( CleaningJob.java ) starts.爬行脚本 ( nutch/bin/crawl ) 工作正常,直到清理作业 ( CleaningJob.java ) 开始。 Then it fails with no reason ( reason: NA ).然后它无缘无故地失败( reason: NA )。

I've set up successfully the same versions of Nutch and Solr, but with Sorl in standalone mode.我已经成功设置了相同版本的 Nutch 和 Solr,但 Sorl 处于独立模式。

I am starting Solr in cloud mode with the following commands:我使用以下命令在云模式下启动 Solr:

solr/bin/solr start -cloud -p 8983 -s "solr/cloud/node1/solr"
solr/bin/solr start -cloud -p 7574 -s "solr/cloud/node2/solr" -z localhost:9983

And I am starting the crawling process with the following command:我正在使用以下命令开始抓取过程:

nutch/bin/crawl -i -s nutch/urls/ --num-threads 400 --hostdbupdate --hostdbgenerate --num-tasks 16 --sitemaps-from-hostdb once niche-crawl 8

It fails on the cleaning job.它在清洁工作中失败。 :

nutch/bin/nutch clean niche-crawl/crawldb

With an exception:有一个例外:

No exchange was configured. The documents will be routed to all index writers.
SolrIndexer: deleting 1000/1000 documents
SolrIndexer: deleting 1000/1000 documents
ERROR CleaningJob: java.lang.RuntimeException: CleaningJob did not succeed, job status:FAILED, reason: NA

        at org.apache.nutch.indexer.CleaningJob.delete(CleaningJob.java:169)
        at org.apache.nutch.indexer.CleaningJob.run(CleaningJob.java:197)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.indexer.CleaningJob.main(CleaningJob.java:208)

Here is my index-writers.xml for Solr in cloud mode:这是我在云模式下用于 Solr 的index-writers.xml

    <writer id="indexer_solr_1" class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
        <parameters>
          <param name="type" value="cloud"/>
          <param name="url" value="http://localhost:8983/solr"/>
          <param name="collection" value="nutch"/>
          <param name="weight.field" value=""/>
          <param name="commitSize" value="1000"/>
          <param name="auth" value="true"/>
          <param name="username" value="solr"/>
          <param name="password" value="password"/>
        </parameters>
        <mapping>
          <copy>
            <!-- <field source="content" dest="search"/> -->
            <!-- <field source="title" dest="title,search"/> -->
          </copy>
          <rename>
            <field source="metatag.description" dest="description"/>
            <field source="metatag.keywords" dest="keywords"/>
          </rename>
          <remove>
            <field source="segment"/>
          </remove>
        </mapping>
      </writer>

Try upgrading to Nutch version 1.16.尝试升级到 Nutch 1.16 版。 This sounds like a known bug https://issues.apache.org/jira/browse/NUTCH-2731 which is fixed in 1.16, see https://apache.org/dist/nutch/1.16/CHANGES.txt这听起来像是一个已知的错误https://issues.apache.org/jira/browse/NUTCH-2731已在 1.16 中修复,请参阅https://apache.org/dist/nutch/1.16/CHANGES.txt

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM