Apache Nutch 1.12和Solr 5.4.1的集成失败

Question

I've successfully crawled several websites and created two segments using Nutch. 我已经成功抓取了多个网站并使用Nutch创建了两个细分。 I've installed and started Solr service as well. 我也已经安装并启动了Solr服务。

But when I am trying to indexing those crawled data into Solr, its not working. 但是，当我尝试将这些已爬网的数据索引到Solr中时，它无法正常工作。

I tried this command: 我尝试了以下命令：

bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/*

Output: 输出：

The input path at crawldb is not a segment... skipping
Segment dir is complete: crawl/segments/20161214143435.
Segment dir is complete: crawl/segments/20161214144230.
Indexer: starting at 2016-12-15 10:55:35
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance
    solr.zookeeper.hosts : URL of the Zookeeper quorum
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication


Indexer: java.io.IOException: No FileSystem for scheme: http
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2385)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
    at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
    at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
    at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
    at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
    at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)

And also this command: 还有这个命令：

bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/*

Output: 输出：

Segment dir is complete: crawl/segments/20161214143435.
Segment dir is complete: crawl/segments/20161214144230.
Indexer: starting at 2016-12-15 10:54:07
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance
    solr.zookeeper.hosts : URL of the Zookeeper quorum
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication


Indexing 250/250 documents
Deleting 0 documents
Indexing 250/250 documents
Deleting 0 documents
Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)

Before these, I copied the nutch/conf/schema/xml file into /Nutch/solr-5.4.1/server/solr/configsets/data_driven_schema_configs/conf and renamed as managed-schema as suggested. 在此之前，我将nutch/conf/schema/xml文件复制到/Nutch/solr-5.4.1/server/solr/configsets/data_driven_schema_configs/conf并根据建议将其重命名为managed-schema 。

What might be my possible mistakes? 我可能会犯什么错误？ Thanks in advance! 提前致谢！

Edit 编辑

This is my Nutch log 这是我的Nutch日志

...........................
...........................
2016-12-15 10:15:48,355 INFO  indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb
2016-12-15 10:15:48,355 INFO  indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb
2016-12-15 10:15:48,355 INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20161214143435
2016-12-15 10:15:48,378 INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20161214144230
2016-12-15 10:15:49,120 WARN  conf.Configuration - file:/tmp/hadoop-kaidul/mapred/staging/kaidul1333791357/.staging/job_local1333791357_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-12-15 10:15:49,122 WARN  conf.Configuration - file:/tmp/hadoop-kaidul/mapred/staging/kaidul1333791357/.staging/job_local1333791357_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-12-15 10:15:49,180 WARN  conf.Configuration - file:/tmp/hadoop-kaidul/mapred/local/localRunner/kaidul/job_local1333791357_0001/job_local1333791357_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-12-15 10:15:49,181 WARN  conf.Configuration - file:/tmp/hadoop-kaidul/mapred/local/localRunner/kaidul/job_local1333791357_0001/job_local1333791357_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-12-15 10:15:49,406 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-12-15 10:15:50,930 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-12-15 10:15:51,137 INFO  solr.SolrMappingReader - source: content dest: content
2016-12-15 10:15:51,137 INFO  solr.SolrMappingReader - source: title dest: title
2016-12-15 10:15:51,137 INFO  solr.SolrMappingReader - source: host dest: host
2016-12-15 10:15:51,137 INFO  solr.SolrMappingReader - source: segment dest: segment
2016-12-15 10:15:51,137 INFO  solr.SolrMappingReader - source: boost dest: boost
2016-12-15 10:15:51,137 INFO  solr.SolrMappingReader - source: digest dest: digest
2016-12-15 10:15:51,137 INFO  solr.SolrMappingReader - source: tstamp dest: tstamp
2016-12-15 10:15:51,243 INFO  solr.SolrIndexWriter - Indexing 250/250 documents
2016-12-15 10:15:51,243 INFO  solr.SolrIndexWriter - Deleting 0 documents
2016-12-15 10:15:51,384 INFO  solr.SolrIndexWriter - Indexing 250/250 documents
2016-12-15 10:15:51,384 INFO  solr.SolrIndexWriter - Deleting 0 documents
2016-12-15 10:15:51,414 WARN  mapred.LocalJobRunner - job_local1333791357_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre>    Not Found</pre></p><hr><i><small>Powered by Jetty://</small></i><hr/>

</body>
</html>

    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
............................
.............................

Answer 1

The problem was version incompatibility between solr, nutch and hbase. 问题是solr，nutch和hbase之间的版本不兼容。 This article worked perfectly for me. 这篇文章对我来说很完美。

Apache Nutch 1.12和Solr 5.4.1的集成失败

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-12-17 05:57:48

Apache Nutch 1.12和Solr 5.4.1的集成失败

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-12-17 05:57:48

解决方案1
0 已采纳 2016-12-17 05:57:48