[英]apache nutch to index to solr via REST
newbie in apache nutch - writing a client to use it via REST. Apache Nut的新手-编写客户端以通过REST使用它。 succeed in all the steps (INJECT, FETCH...) - in the last step - when trying to index to solr - it fails to pass the parameter. 在所有步骤(INJECT,FETCH ...)中都成功-在最后一步-尝试索引到solr时-无法传递参数。 The Request (I formatted it in some website) 请求(我在某些网站上对其进行了格式化)
{
"args": {
"batch": "1463743197862",
"crawlId": "sample-crawl-01",
"solr.server.url": "http:\/\/x.x.x.x:8081\/solr\/"
},
"confId": "default",
"type": "INDEX",
"crawlId": "sample-crawl-01"
}
The Nutch logs: Nutch日志:
java.lang.Exception: java.lang.RuntimeException: Missing SOLR URL. Should be set via -D solr.server.url
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Was that implemented? 实施了吗? the param passing to solr plugin? 传递给solr插件的参数?
You need to create/update a configuration using the /config/create/
endpoint, with a POST request and a payload similar to: 您需要使用/config/create/
端点创建/更新配置,并带有POST请求和类似于以下内容的有效负载:
{
"configId":"solr-config",
"force":"true",
"params":{"solr.server.url":"http://127.0.0.1:8983/solr/"}
}
In this case I'm creating a new configuration and specifying the solr.server.url
parameter. 在这种情况下,我将创建一个新配置并指定solr.server.url
参数。 You can verify this is working with a GET request to /config/solr-config
( solr-config
is the previously specified configId
), the output should contain all the default parameters see https://gist.github.com/jorgelbg/689b1d66d116fa55a1ee14d7193d71b4 for an example/default output. 您可以验证是否正在处理对/config/solr-config
的GET请求( solr-config
是先前指定的configId
),输出应包含所有默认参数,请参见https://gist.github.com/jorgelbg/689b1d66d116fa55a1ee14d7193d71b4用于示例/默认输出。 If everything worked fine in the returned JSON you should see the solr.server.url
option with the desired value https://gist.github.com/jorgelbg/689b1d66d116fa55a1ee14d7193d71b4#file-nutch-solr-config-json-L464 . 如果返回的JSON一切正常,您应该看到solr.server.url
选项,具有所需的值https://gist.github.com/jorgelbg/689b1d66d116fa55a1ee14d7193d71b4#file-nutch-solr-config-json-L464 。
After this just hit the /job/create
endpoint to create a new INDEX
Job, the payload should be something like: 在刚刚击中/job/create
端点以创建新的INDEX
Job之后,有效负载应为:
{
"type":"INDEX",
"confId":"solr-config",
"crawlId":"crawl01",
"args": {}
}
The idea is that need to you pass the configId
that you created with the solr.server.url
specified along with the crawlId
and other args. 这个想法是,需要你通过configId
您用创建solr.server.url
与一起指定crawlId
等ARGS。 This should return something similar to: 这应该返回类似以下内容:
{
"id": "crawl01-solr-config-INDEX-1252914231",
"type": "INDEX",
"confId": "solr-config",
"args": {},
"result": null,
"state": "RUNNING",
"msg": "OK",
"crawlId": "crawl01"
}
Bottom line you need to create a new configuration with the solr.server.url
setted instead of specifying it through the args
key in the JSON payload. 最重要的是,您需要使用设置的solr.server.url
创建一个新配置,而不是通过JSON有效负载中的args
键来指定它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.