在“Bluemix”solr中索引nutch已爬网数据

Question

我试图通过Bluemix solr索引nutch抓取的数据，但我无法找到它。 我的主要问题是：有没有人可以帮助我这样做？ 我该怎么做才能将我的nutch抓取数据的结果发送到我的Blumix Solr。 对于爬行，我使用了nutch 1.11，这是我现在所做的一部分以及我遇到的问题：我认为可能有两种可能的解决方案：

通过nutch命令：

“NUTCH_PATH / bin / nutch index crawl / crawldb -linkdb crawl / linkdb crawl / -Dsolr.server.url =”OURSOLRURL“”

我可以通过OURSOLR索引nutch抓取的数据。 但是，我发现了一些问题。

a-虽然听起来很奇怪，但它无法接受URL。 我可以通过使用URL的编码来处理它。

b-由于我必须连接到特定的用户名和密码，因此nutch无法连接到我的solr。 考虑到这一点：

 Active IndexWriters :
 SolrIndexWriter
    solr.server.type : Type of SolrServer to communicate with (default 'http' however options include 'cloud', 'lb' and 'concurrent')
    solr.server.url : URL of the Solr instance (mandatory)
    solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud' value for solr.server.type)
    solr.loadbalance.urls : Comma-separated string of Solr server strings to be used (madatory if 'lb' value for solr.server.type)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.commit.size : buffer size when sending to Solr (default 1000)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication

在命令行输出中，我尝试通过使用命令“solr.auth = true solr.auth.username =”SOLR-UserName“solr.auth.password =”Pass“的认证参数来管理此问题。

所以到目前为止我还是要使用这个命令：

“bin / nutch index crawl / crawldb -linkdb crawl / linkdb crawl / segments / 2016 * solr.server.url =”https％3A％2F％2Fgateway.watsonplatform.net％2Fretrieve-and-rank％2Fapi％2Fv1％2Fsolr_clusters％ 2FCLUSTER-ID％2Fsolr％2Fadmin％2Fcollections“solr.auth = true solr.auth.username =”USERNAME“solr.auth.password =”PASS“”。

但由于某些原因我还无法实现，该命令将认证参数视为已爬网数据目录，但不起作用。 所以我想这不是“Active IndexWriters”的正确方法，任何人都可以告诉我，我怎么能？

通过curl命令：

“curl -X POST -H”Content-Type：application / json“-u”BLUEMIXSOLR-USERNAME“：”BLUEMIXSOLR-PASS“” https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/ solr_clusters / CLUSTERS-ID / solr / example_collection / update “--data-binary @ {/ path_to_file} /FILE.json”

我想也许我可以提供这个命令创建的json文件：

bin / nutch commoncrawldump -outputDir finalcrawlResult / -sgment crawl / segments -gzip -extension json -SimpleDateFormat -epochFilename -jsonArray -reverseKey但这里有一些问题。

一种。 这个命令在复杂的路径中提供了这么多文件，这需要花费很多时间来手动发布所有这些文件。我猜对于大问题，它甚至可能是不可能的。 有没有办法只通过一个命令一次POST一个目录及其子目录中的所有文件？

湾 在commoncrawldump创建的json文件的开头有一个奇怪的名字“ÙÙ÷yœ”。

C。 我删除了名称奇怪的名称，并尝试POST这些文件中的一个，但结果如下：

 {"responseHeader":{"status":400,"QTime":23},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"Unknown command 'url' at [9]","code":400}}

这是不是意味着这些文件无法提供给Bluemix solr而且对我来说都没用？

Answer 1

感谢Lewis John Mcgibbney我意识到应该使用索引工具如下：

bin / nutch index -D solr.server.url =“https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTER-ID/solr/admin/collections -D solr.auth = true -D solr.auth.username =“USERNAME”-D solr.auth.password =“PASS”Crawl / crawldb -linkdb Crawl / linkdb Crawl / segments / 2016 *

意思是：在每个auth参数之前使用-D并在Tool参数的右边提到这些参数。

Answer 2

要在Bluemix Retrieve和Rank服务中索引nutch已爬网数据，应该：

用nutch爬行种子，例如

$：bin / crawl -w 5 urls抓取25

你可以检查抓取的状态：

bin / nutch readdb crawl / crawldb / -stats

转储已爬网的数据文件：

$：bin / nutch dump -flatdir -outputDir dumpData / -segment crawl / segments /
将那些可能的例如xml文件发布到solve Collection on Retrieve和Rank：
Post_url ='“ https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/%s/solr/%s/update ”'％（solr_cluster_id，solr_collection_name）cmd ='''curl - X POST -H％s -u％s％s --data-binary @％s'''％（Cont_type_xml，solr_credentials，Post_url，myfilename）subprocess.call（cmd，shell = True）

使用Bluemix Doc-Conv服务将其余部分转换为json：

 doc_conv_url = '"https://gateway.watsonplatform.net/document-conversion/api/v1/convert_document?version=2015-12-15"' cmd ='''curl -X POST -u %s -F config="{\\\\"conversion_target\\\\":\\\\"answer_units\\\\"}" -F file=@%s %s''' %(doc_conv_credentials, myfilename, doc_conv_url) process = subprocess.Popen(cmd, shell= True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

然后将这些Json结果保存在json文件中。

将此json文件发布到集合：

 Post_converted_url = '"https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/%s/solr/%s/update/json/docs?commit=true&split=/answer_units/id&f=id:/answer_units/id&f=title:/answer_units/title&f=body:/answer_units/content/text"' %(solr_cluster_id, solr_collection_name) cmd ='''curl -X POST -H %s -u %s %s --data-binary @%s''' %(Cont_type_json, solr_credentials, Post_converted_url, Path_jsonFile) subprocess.call(cmd,shell=True)

发送查询：

 pysolr_client = retrieve_and_rank.get_pysolr_client(solr_cluster_id, solr_collection_name) results = pysolr_client.search(Query_term) print(results.docs)

代码是python。 对于初学者：您可以直接在CMD中使用curl命令。 我希望它有所帮助

在“Bluemix”solr中索引nutch已爬网数据

问题描述

2 个解决方案

解决方案1
0 2016-06-16 20:54:25

解决方案2
0 已采纳 2016-07-19 02:05:36

在“Bluemix”solr中索引nutch已爬网数据

问题描述

2 个解决方案

解决方案1 0 2016-06-16 20:54:25

解决方案2 0 已采纳 2016-07-19 02:05:36

解决方案1
0 2016-06-16 20:54:25

解决方案2
0 已采纳 2016-07-19 02:05:36