简体   繁体   English

Apache nutch以通过REST索引到solr

[英]apache nutch to index to solr via REST

newbie in apache nutch - writing a client to use it via REST. Apache Nut的新手-编写客户端以通过REST使用它。 succeed in all the steps (INJECT, FETCH...) - in the last step - when trying to index to solr - it fails to pass the parameter. 在所有步骤(INJECT,FETCH ...)中都成功-在最后一步-尝试索引到solr时-无法传递参数。 The Request (I formatted it in some website) 请求(我在某些网站上对其进行了格式化)

{
  "args": {
    "batch": "1463743197862",
    "crawlId": "sample-crawl-01",
    "solr.server.url": "http:\/\/x.x.x.x:8081\/solr\/"
  },
  "confId": "default",
  "type": "INDEX",
  "crawlId": "sample-crawl-01"
}

The Nutch logs: Nutch日志:

java.lang.Exception: java.lang.RuntimeException: Missing SOLR URL. Should be set via -D solr.server.url
SOLRIndexWriter
        solr.server.url : URL of the SOLR instance (mandatory)
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
        solr.auth : use authentication (default false)
        solr.auth.username : username for authentication
        solr.auth.password : password for authentication
        at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)

Was that implemented? 实施了吗? the param passing to solr plugin? 传递给solr插件的参数?

You need to create/update a configuration using the /config/create/ endpoint, with a POST request and a payload similar to: 您需要使用/config/create/端点创建/更新配置,并带有POST请求和类似于以下内容的有效负载:

{
    "configId":"solr-config",
    "force":"true",
    "params":{"solr.server.url":"http://127.0.0.1:8983/solr/"}
}

In this case I'm creating a new configuration and specifying the solr.server.url parameter. 在这种情况下,我将创建一个新配置并指定solr.server.url参数。 You can verify this is working with a GET request to /config/solr-config ( solr-config is the previously specified configId ), the output should contain all the default parameters see https://gist.github.com/jorgelbg/689b1d66d116fa55a1ee14d7193d71b4 for an example/default output. 您可以验证是否正在处理对/config/solr-config的GET请求( solr-config是先前指定的configId ),输出应包含所有默认参数,请参见https://gist.github.com/jorgelbg/689b1d66d116fa55a1ee14d7193d71b4用于示例/默认输出。 If everything worked fine in the returned JSON you should see the solr.server.url option with the desired value https://gist.github.com/jorgelbg/689b1d66d116fa55a1ee14d7193d71b4#file-nutch-solr-config-json-L464 . 如果返回的JSON一切正常,您应该看到solr.server.url选项,具有所需的值https://gist.github.com/jorgelbg/689b1d66d116fa55a1ee14d7193d71b4#file-nutch-solr-config-json-L464

After this just hit the /job/create endpoint to create a new INDEX Job, the payload should be something like: 在刚刚击中/job/create端点以创建新的INDEX Job之后,有效负载应为:

{
    "type":"INDEX",
    "confId":"solr-config",
    "crawlId":"crawl01",
    "args": {}
}

The idea is that need to you pass the configId that you created with the solr.server.url specified along with the crawlId and other args. 这个想法是,需要你通过configId您用创建solr.server.url与一起指定crawlId等ARGS。 This should return something similar to: 这应该返回类似以下内容:

{
  "id": "crawl01-solr-config-INDEX-1252914231",
  "type": "INDEX",
  "confId": "solr-config",
  "args": {},
  "result": null,
  "state": "RUNNING",
  "msg": "OK",
  "crawlId": "crawl01"
}

Bottom line you need to create a new configuration with the solr.server.url setted instead of specifying it through the args key in the JSON payload. 最重要的是,您需要使用设置的solr.server.url创建一个新配置,而不是通过JSON有效负载中的args键来指定它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM