简体   繁体   English

将数据从Nutch传递给Solr

[英]Pass data from Nutch to Solr

I'm trying to pass the data crawled by the Nutch web cralwer to the Solr search and indexing platform using the following command: 我正在尝试使用以下命令将Nutch web cralwer抓取的数据传递给Solr搜索和索引平台:

bin/nutch index -Dsolr.server.url=http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/ -dir crawl/segments/20161124145935/ crawl/segments/20161124150145/ -filter -normalize

But I'm getting the following error: 但是我收到以下错误:

The input path at segments is not a segment... skipping
The input path at content is not a segment... skipping
The input path at crawl_fetch is not a segment... skipping
Skipping segment: file:/Users/cell/Desktop/usi/information-retrieval/project/apache-nutch-1.12/crawl/segments/20161124145935/crawl_generate. Missing sub directories: parse_data, parse_text, crawl_parse, crawl_fetch
The input path at crawl_parse is not a segment... skipping
The input path at parse_data is not a segment... skipping
The input path at parse_text is not a segment... skipping
Segment dir is complete: crawl/segments/20161124150145.
Indexer: starting at 2016-11-25 05:02:17
Indexer: deleting gone documents: false
Indexer: URL filtering: true
Indexer: URL normalizing: true
Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance
    solr.zookeeper.hosts : URL of the Zookeeper quorum
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication


Indexing 250/250 documents
Deleting 0 documents
Indexing 250/250 documents
Deleting 0 documents
Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)

Here's the log from Nutch: 这是Nutch的日志:

2016-11-25 06:05:03,378 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-11-25 06:05:03,500 WARN  segment.SegmentChecker - The input path at segments is not a segment... skipping
2016-11-25 06:05:03,506 WARN  segment.SegmentChecker - The input path at content is not a segment... skipping
2016-11-25 06:05:03,506 WARN  segment.SegmentChecker - The input path at crawl_fetch is not a segment... skipping
2016-11-25 06:05:03,507 WARN  segment.SegmentChecker - Skipping segment: file:/Users/cell/Desktop/usi/information-retrieval/project/apache-nutch-1.12/crawl/segments/20161124145935/crawl_generate. Missing sub directories: parse_data, parse_text, crawl_parse, crawl_fetch
2016-11-25 06:05:03,507 WARN  segment.SegmentChecker - The input path at crawl_parse is not a segment... skipping
2016-11-25 06:05:03,507 WARN  segment.SegmentChecker - The input path at parse_data is not a segment... skipping
2016-11-25 06:05:03,507 WARN  segment.SegmentChecker - The input path at parse_text is not a segment... skipping
2016-11-25 06:05:03,509 INFO  segment.SegmentChecker - Segment dir is complete: crawl/segments/20161124150145.
2016-11-25 06:05:03,510 INFO  indexer.IndexingJob - Indexer: starting at 2016-11-25 06:05:03
2016-11-25 06:05:03,512 INFO  indexer.IndexingJob - Indexer: deleting gone documents: false
2016-11-25 06:05:03,512 INFO  indexer.IndexingJob - Indexer: URL filtering: true
2016-11-25 06:05:03,512 INFO  indexer.IndexingJob - Indexer: URL normalizing: true
2016-11-25 06:05:03,614 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-11-25 06:05:03,615 INFO  indexer.IndexingJob - Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance
    solr.zookeeper.hosts : URL of the Zookeeper quorum
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication


2016-11-25 06:05:03,616 INFO  indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb
2016-11-25 06:05:03,616 INFO  indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb
2016-11-25 06:05:03,617 INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20161124150145
2016-11-25 06:05:04,006 WARN  conf.Configuration - file:/tmp/hadoop-cell/mapred/staging/cell1463380038/.staging/job_local1463380038_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-11-25 06:05:04,010 WARN  conf.Configuration - file:/tmp/hadoop-cell/mapred/staging/cell1463380038/.staging/job_local1463380038_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-11-25 06:05:04,088 WARN  conf.Configuration - file:/tmp/hadoop-cell/mapred/local/localRunner/cell/job_local1463380038_0001/job_local1463380038_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-11-25 06:05:04,090 WARN  conf.Configuration - file:/tmp/hadoop-cell/mapred/local/localRunner/cell/job_local1463380038_0001/job_local1463380038_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-11-25 06:05:04,258 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-11-25 06:05:04,272 INFO  regex.RegexURLNormalizer - can't find rules for scope 'indexer', using default
2016-11-25 06:05:08,950 INFO  regex.RegexURLNormalizer - can't find rules for scope 'indexer', using default
2016-11-25 06:05:09,344 INFO  regex.RegexURLNormalizer - can't find rules for scope 'indexer', using default
2016-11-25 06:05:09,734 INFO  regex.RegexURLNormalizer - can't find rules for scope 'indexer', using default
2016-11-25 06:05:10,908 INFO  regex.RegexURLNormalizer - can't find rules for scope 'indexer', using default
2016-11-25 06:05:11,376 INFO  regex.RegexURLNormalizer - can't find rules for scope 'indexer', using default
2016-11-25 06:05:11,686 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-11-25 06:05:11,775 INFO  solr.SolrMappingReader - source: content dest: content
2016-11-25 06:05:11,775 INFO  solr.SolrMappingReader - source: title dest: title
2016-11-25 06:05:11,775 INFO  solr.SolrMappingReader - source: host dest: host
2016-11-25 06:05:11,775 INFO  solr.SolrMappingReader - source: segment dest: segment
2016-11-25 06:05:11,775 INFO  solr.SolrMappingReader - source: boost dest: boost
2016-11-25 06:05:11,775 INFO  solr.SolrMappingReader - source: digest dest: digest
2016-11-25 06:05:11,775 INFO  solr.SolrMappingReader - source: tstamp dest: tstamp
2016-11-25 06:05:11,940 INFO  solr.SolrIndexWriter - Indexing 250/250 documents
2016-11-25 06:05:11,940 INFO  solr.SolrIndexWriter - Deleting 0 documents
2016-11-25 06:05:12,139 INFO  solr.SolrIndexWriter - Indexing 250/250 documents
2016-11-25 06:05:12,139 INFO  solr.SolrIndexWriter - Deleting 0 documents
2016-11-25 06:05:12,207 WARN  mapred.LocalJobRunner - job_local1463380038_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre>    Not Found</pre></p>
</body>
</html>

    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre>    Not Found</pre></p>
</body>
</html>

    at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:543)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
    at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:209)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:173)
    at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)
    at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:422)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:367)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
2016-11-25 06:05:12,293 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)

I've not created any core or collections from the UI and honestly I'm not sure exactly of the meaning of this command to pass data to solr... 我没有从UI创建任何核心或集合,老实说,我不确定这个命令的意思是将数据传递给solr ...

Since I'm very new to both Nutch and Solr, this is difficult to debug... 由于我对Nutch和Solr都很新,所以很难调试......

The log show the error, since you didn't create any core/collection the SolrJ library is complaining about not finding the /solr/update handler which causes the index step to fail. 日志显示错误,因为您没有创建任何核心/集合,SolrJ库抱怨没有找到导致索引步骤失败的/solr/update处理程序。 Just create a core/collection and update the solr URL that you pass to the bin/crawl script. 只需创建核心/集合并更新传递给bin/crawl脚本的solr URL。 Just follow the steps in https://wiki.apache.org/nutch/NutchTutorial to do your first crawl. 只需按照https://wiki.apache.org/nutch/NutchTutorial中的步骤进行首次抓取即可。

Follow this link . 点击此链接 I was facing same problem as you. 我和你一样面临同样的问题。 This step-by-step process will definitely work. 这个循序渐进的过程一定会奏效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM