在“Bluemix”solr中索引nutch已爬網數據

Question

我試圖通過Bluemix solr索引nutch抓取的數據，但我無法找到它。 我的主要問題是：有沒有人可以幫助我這樣做？ 我該怎么做才能將我的nutch抓取數據的結果發送到我的Blumix Solr。 對於爬行，我使用了nutch 1.11，這是我現在所做的一部分以及我遇到的問題：我認為可能有兩種可能的解決方案：

通過nutch命令：

“NUTCH_PATH / bin / nutch index crawl / crawldb -linkdb crawl / linkdb crawl / -Dsolr.server.url =”OURSOLRURL“”

我可以通過OURSOLR索引nutch抓取的數據。 但是，我發現了一些問題。

a-雖然聽起來很奇怪，但它無法接受URL。 我可以通過使用URL的編碼來處理它。

b-由於我必須連接到特定的用戶名和密碼，因此nutch無法連接到我的solr。 考慮到這一點：

 Active IndexWriters :
 SolrIndexWriter
    solr.server.type : Type of SolrServer to communicate with (default 'http' however options include 'cloud', 'lb' and 'concurrent')
    solr.server.url : URL of the Solr instance (mandatory)
    solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud' value for solr.server.type)
    solr.loadbalance.urls : Comma-separated string of Solr server strings to be used (madatory if 'lb' value for solr.server.type)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.commit.size : buffer size when sending to Solr (default 1000)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication

在命令行輸出中，我嘗試通過使用命令“solr.auth = true solr.auth.username =”SOLR-UserName“solr.auth.password =”Pass“的認證參數來管理此問題。

所以到目前為止我還是要使用這個命令：

“bin / nutch index crawl / crawldb -linkdb crawl / linkdb crawl / segments / 2016 * solr.server.url =”https％3A％2F％2Fgateway.watsonplatform.net％2Fretrieve-and-rank％2Fapi％2Fv1％2Fsolr_clusters％ 2FCLUSTER-ID％2Fsolr％2Fadmin％2Fcollections“solr.auth = true solr.auth.username =”USERNAME“solr.auth.password =”PASS“”。

但由於某些原因我還無法實現，該命令將認證參數視為已爬網數據目錄，但不起作用。 所以我想這不是“Active IndexWriters”的正確方法，任何人都可以告訴我，我怎么能？

通過curl命令：

“curl -X POST -H”Content-Type：application / json“-u”BLUEMIXSOLR-USERNAME“：”BLUEMIXSOLR-PASS“” https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/ solr_clusters / CLUSTERS-ID / solr / example_collection / update “--data-binary @ {/ path_to_file} /FILE.json”

我想也許我可以提供這個命令創建的json文件：

bin / nutch commoncrawldump -outputDir finalcrawlResult / -sgment crawl / segments -gzip -extension json -SimpleDateFormat -epochFilename -jsonArray -reverseKey但這里有一些問題。

一種。 這個命令在復雜的路徑中提供了這么多文件，這需要花費很多時間來手動發布所有這些文件。我猜對於大問題，它甚至可能是不可能的。 有沒有辦法只通過一個命令一次POST一個目錄及其子目錄中的所有文件？

灣 在commoncrawldump創建的json文件的開頭有一個奇怪的名字“ÙÙ÷yœ”。

C。 我刪除了名稱奇怪的名稱，並嘗試POST這些文件中的一個，但結果如下：

 {"responseHeader":{"status":400,"QTime":23},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"Unknown command 'url' at [9]","code":400}}

這是不是意味着這些文件無法提供給Bluemix solr而且對我來說都沒用？

Answer 1

感謝Lewis John Mcgibbney我意識到應該使用索引工具如下：

bin / nutch index -D solr.server.url =“https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTER-ID/solr/admin/collections -D solr.auth = true -D solr.auth.username =“USERNAME”-D solr.auth.password =“PASS”Crawl / crawldb -linkdb Crawl / linkdb Crawl / segments / 2016 *

意思是：在每個auth參數之前使用-D並在Tool參數的右邊提到這些參數。

Answer 2

要在Bluemix Retrieve和Rank服務中索引nutch已爬網數據，應該：

用nutch爬行種子，例如

$：bin / crawl -w 5 urls抓取25

你可以檢查抓取的狀態：

bin / nutch readdb crawl / crawldb / -stats

轉儲已爬網的數據文件：

$：bin / nutch dump -flatdir -outputDir dumpData / -segment crawl / segments /
將那些可能的例如xml文件發布到solve Collection on Retrieve和Rank：
Post_url ='“ https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/%s/solr/%s/update ”'％（solr_cluster_id，solr_collection_name）cmd ='''curl - X POST -H％s -u％s％s --data-binary @％s'''％（Cont_type_xml，solr_credentials，Post_url，myfilename）subprocess.call（cmd，shell = True）

使用Bluemix Doc-Conv服務將其余部分轉換為json：

 doc_conv_url = '"https://gateway.watsonplatform.net/document-conversion/api/v1/convert_document?version=2015-12-15"' cmd ='''curl -X POST -u %s -F config="{\\\\"conversion_target\\\\":\\\\"answer_units\\\\"}" -F file=@%s %s''' %(doc_conv_credentials, myfilename, doc_conv_url) process = subprocess.Popen(cmd, shell= True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

然后將這些Json結果保存在json文件中。

將此json文件發布到集合：

 Post_converted_url = '"https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/%s/solr/%s/update/json/docs?commit=true&split=/answer_units/id&f=id:/answer_units/id&f=title:/answer_units/title&f=body:/answer_units/content/text"' %(solr_cluster_id, solr_collection_name) cmd ='''curl -X POST -H %s -u %s %s --data-binary @%s''' %(Cont_type_json, solr_credentials, Post_converted_url, Path_jsonFile) subprocess.call(cmd,shell=True)

發送查詢：

 pysolr_client = retrieve_and_rank.get_pysolr_client(solr_cluster_id, solr_collection_name) results = pysolr_client.search(Query_term) print(results.docs)

代碼是python。 對於初學者：您可以直接在CMD中使用curl命令。 我希望它有所幫助

在“Bluemix”solr中索引nutch已爬網數據

問題描述

2 個解決方案

解決方案1
0 2016-06-16 20:54:25

解決方案2
0 已采納 2016-07-19 02:05:36

在“Bluemix”solr中索引nutch已爬網數據

問題描述

2 個解決方案

解決方案1 0 2016-06-16 20:54:25

解決方案2 0 已采納 2016-07-19 02:05:36

解決方案1
0 2016-06-16 20:54:25

解決方案2
0 已采納 2016-07-19 02:05:36