简体   繁体   English

Apache Nutch 1.12和Solr 5.4.1的集成失败

[英]Integration of Apache Nutch 1.12 and Solr 5.4.1 failed

I've successfully crawled several websites and created two segments using Nutch. 我已经成功抓取了多个网站并使用Nutch创建了两个细分。 I've installed and started Solr service as well. 我也已经安装并启动了Solr服务。

But when I am trying to indexing those crawled data into Solr, its not working. 但是,当我尝试将这些已爬网的数据索引到Solr中时,它无法正常工作。

I tried this command: 我尝试了以下命令:

bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/*

Output: 输出:

The input path at crawldb is not a segment... skipping
Segment dir is complete: crawl/segments/20161214143435.
Segment dir is complete: crawl/segments/20161214144230.
Indexer: starting at 2016-12-15 10:55:35
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance
    solr.zookeeper.hosts : URL of the Zookeeper quorum
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication


Indexer: java.io.IOException: No FileSystem for scheme: http
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2385)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
    at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
    at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
    at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
    at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
    at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)

And also this command: 还有这个命令:

bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/*

Output: 输出:

Segment dir is complete: crawl/segments/20161214143435.
Segment dir is complete: crawl/segments/20161214144230.
Indexer: starting at 2016-12-15 10:54:07
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance
    solr.zookeeper.hosts : URL of the Zookeeper quorum
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication


Indexing 250/250 documents
Deleting 0 documents
Indexing 250/250 documents
Deleting 0 documents
Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)

Before these, I copied the nutch/conf/schema/xml file into /Nutch/solr-5.4.1/server/solr/configsets/data_driven_schema_configs/conf and renamed as managed-schema as suggested. 在此之前,我将nutch/conf/schema/xml文件复制到/Nutch/solr-5.4.1/server/solr/configsets/data_driven_schema_configs/conf并根据建议将其重命名为managed-schema

What might be my possible mistakes? 我可能会犯什么错误? Thanks in advance! 提前致谢!

Edit 编辑

This is my Nutch log 这是我的Nutch日志

...........................
...........................
2016-12-15 10:15:48,355 INFO  indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb
2016-12-15 10:15:48,355 INFO  indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb
2016-12-15 10:15:48,355 INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20161214143435
2016-12-15 10:15:48,378 INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20161214144230
2016-12-15 10:15:49,120 WARN  conf.Configuration - file:/tmp/hadoop-kaidul/mapred/staging/kaidul1333791357/.staging/job_local1333791357_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-12-15 10:15:49,122 WARN  conf.Configuration - file:/tmp/hadoop-kaidul/mapred/staging/kaidul1333791357/.staging/job_local1333791357_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-12-15 10:15:49,180 WARN  conf.Configuration - file:/tmp/hadoop-kaidul/mapred/local/localRunner/kaidul/job_local1333791357_0001/job_local1333791357_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-12-15 10:15:49,181 WARN  conf.Configuration - file:/tmp/hadoop-kaidul/mapred/local/localRunner/kaidul/job_local1333791357_0001/job_local1333791357_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-12-15 10:15:49,406 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-12-15 10:15:50,930 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-12-15 10:15:51,137 INFO  solr.SolrMappingReader - source: content dest: content
2016-12-15 10:15:51,137 INFO  solr.SolrMappingReader - source: title dest: title
2016-12-15 10:15:51,137 INFO  solr.SolrMappingReader - source: host dest: host
2016-12-15 10:15:51,137 INFO  solr.SolrMappingReader - source: segment dest: segment
2016-12-15 10:15:51,137 INFO  solr.SolrMappingReader - source: boost dest: boost
2016-12-15 10:15:51,137 INFO  solr.SolrMappingReader - source: digest dest: digest
2016-12-15 10:15:51,137 INFO  solr.SolrMappingReader - source: tstamp dest: tstamp
2016-12-15 10:15:51,243 INFO  solr.SolrIndexWriter - Indexing 250/250 documents
2016-12-15 10:15:51,243 INFO  solr.SolrIndexWriter - Deleting 0 documents
2016-12-15 10:15:51,384 INFO  solr.SolrIndexWriter - Indexing 250/250 documents
2016-12-15 10:15:51,384 INFO  solr.SolrIndexWriter - Deleting 0 documents
2016-12-15 10:15:51,414 WARN  mapred.LocalJobRunner - job_local1333791357_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre>    Not Found</pre></p><hr><i><small>Powered by Jetty://</small></i><hr/>

</body>
</html>

    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
............................
.............................

The problem was version incompatibility between solr, nutch and hbase. 问题是solr,nutch和hbase之间的版本不兼容。 This article worked perfectly for me. 这篇文章对我来说很完美。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM