[英]Integration of Apache Nutch 1.12 and Solr 5.4.1 failed
I've successfully crawled several websites and created two segments using Nutch. 我已经成功抓取了多个网站并使用Nutch创建了两个细分。 I've installed and started Solr service as well.
我也已经安装并启动了Solr服务。
But when I am trying to indexing those crawled data into Solr, its not working. 但是,当我尝试将这些已爬网的数据索引到Solr中时,它无法正常工作。
I tried this command: 我尝试了以下命令:
bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/*
Output: 输出:
The input path at crawldb is not a segment... skipping
Segment dir is complete: crawl/segments/20161214143435.
Segment dir is complete: crawl/segments/20161214144230.
Indexer: starting at 2016-12-15 10:55:35
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance
solr.zookeeper.hosts : URL of the Zookeeper quorum
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
Indexer: java.io.IOException: No FileSystem for scheme: http
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2385)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
And also this command: 还有这个命令:
bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/*
Output: 输出:
Segment dir is complete: crawl/segments/20161214143435.
Segment dir is complete: crawl/segments/20161214144230.
Indexer: starting at 2016-12-15 10:54:07
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance
solr.zookeeper.hosts : URL of the Zookeeper quorum
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
Indexing 250/250 documents
Deleting 0 documents
Indexing 250/250 documents
Deleting 0 documents
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
Before these, I copied the nutch/conf/schema/xml
file into /Nutch/solr-5.4.1/server/solr/configsets/data_driven_schema_configs/conf
and renamed as managed-schema
as suggested. 在此之前,我将
nutch/conf/schema/xml
文件复制到/Nutch/solr-5.4.1/server/solr/configsets/data_driven_schema_configs/conf
并根据建议将其重命名为managed-schema
。
What might be my possible mistakes? 我可能会犯什么错误? Thanks in advance!
提前致谢!
Edit 编辑
This is my Nutch log 这是我的Nutch日志
...........................
...........................
2016-12-15 10:15:48,355 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb
2016-12-15 10:15:48,355 INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb
2016-12-15 10:15:48,355 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20161214143435
2016-12-15 10:15:48,378 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20161214144230
2016-12-15 10:15:49,120 WARN conf.Configuration - file:/tmp/hadoop-kaidul/mapred/staging/kaidul1333791357/.staging/job_local1333791357_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2016-12-15 10:15:49,122 WARN conf.Configuration - file:/tmp/hadoop-kaidul/mapred/staging/kaidul1333791357/.staging/job_local1333791357_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2016-12-15 10:15:49,180 WARN conf.Configuration - file:/tmp/hadoop-kaidul/mapred/local/localRunner/kaidul/job_local1333791357_0001/job_local1333791357_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2016-12-15 10:15:49,181 WARN conf.Configuration - file:/tmp/hadoop-kaidul/mapred/local/localRunner/kaidul/job_local1333791357_0001/job_local1333791357_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2016-12-15 10:15:49,406 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-12-15 10:15:50,930 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: content dest: content
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: title dest: title
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: host dest: host
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: segment dest: segment
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: boost dest: boost
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: digest dest: digest
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2016-12-15 10:15:51,243 INFO solr.SolrIndexWriter - Indexing 250/250 documents
2016-12-15 10:15:51,243 INFO solr.SolrIndexWriter - Deleting 0 documents
2016-12-15 10:15:51,384 INFO solr.SolrIndexWriter - Indexing 250/250 documents
2016-12-15 10:15:51,384 INFO solr.SolrIndexWriter - Deleting 0 documents
2016-12-15 10:15:51,414 WARN mapred.LocalJobRunner - job_local1333791357_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre> Not Found</pre></p><hr><i><small>Powered by Jetty://</small></i><hr/>
</body>
</html>
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
............................
.............................
The problem was version incompatibility between solr, nutch and hbase. 问题是solr,nutch和hbase之间的版本不兼容。 This article worked perfectly for me.
这篇文章对我来说很完美。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.