简体   繁体   中英

RuntimeException when nutch generate

I'm new to nutch. I have installed nutch 2.3.1 and configure it to use mongodb. The inject operation was successful but when I try to generate it generate an exception (see below). NB : This error is generated with a seed file containing 60K urls. So I've tried with 100 urls and everything went well.

Do you have an idea what is the cause of this error ? Thanks !!!

    2016-12-30 00:01:48,446 INFO  crawl.GeneratorJob - GeneratorJob: starting at 2016-12-30 00:01:48
2016-12-30 00:01:48,447 INFO  crawl.GeneratorJob - GeneratorJob: Selecting best-scoring urls due for fetch.
2016-12-30 00:01:48,447 INFO  crawl.GeneratorJob - GeneratorJob: starting
2016-12-30 00:01:48,448 INFO  crawl.GeneratorJob - GeneratorJob: filtering: true
2016-12-30 00:01:48,448 INFO  crawl.GeneratorJob - GeneratorJob: normalizing: true
2016-12-30 00:01:48,448 INFO  crawl.GeneratorJob - GeneratorJob: topN: 100000
2016-12-30 00:01:48,816 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-12-30 00:01:48,857 INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2016-12-30 00:01:48,867 INFO  crawl.AbstractFetchSchedule - defaultInterval=2592000
2016-12-30 00:01:48,867 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2016-12-30 00:01:51,568 WARN  conf.Configuration - file:/tmp/hadoop-mehdi/mapred/staging/mehdi1740651658/.staging/job_local1740651658_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-12-30 00:01:51,573 WARN  conf.Configuration - file:/tmp/hadoop-mehdi/mapred/staging/mehdi1740651658/.staging/job_local1740651658_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-12-30 00:01:51,753 WARN  conf.Configuration - file:/tmp/hadoop-mehdi/mapred/local/localRunner/mehdi/job_local1740651658_0001/job_local1740651658_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-12-30 00:01:51,760 WARN  conf.Configuration - file:/tmp/hadoop-mehdi/mapred/local/localRunner/mehdi/job_local1740651658_0001/job_local1740651658_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-12-30 00:01:52,408 INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2016-12-30 00:01:52,408 INFO  crawl.AbstractFetchSchedule - defaultInterval=2592000
2016-12-30 00:01:52,408 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2016-12-30 00:01:52,591 INFO  regex.RegexURLNormalizer - can't find rules for scope 'generate_host_count', using default
2016-12-30 00:02:03,229 ERROR mapreduce.GoraRecordReader - Error reading Gora records: Read operation to server localhost:27017 failed on database nutch
2016-12-30 00:02:04,607 WARN  mapred.LocalJobRunner - job_local1740651658_0001
java.lang.Exception: java.lang.RuntimeException: com.mongodb.MongoException$Network: Read operation to server localhost:27017 failed on database nutch
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.RuntimeException: com.mongodb.MongoException$Network: Read operation to server localhost:27017 failed on database nutch
    at org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:122)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533)
    at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: com.mongodb.MongoException$Network: Read operation to server localhost:27017 failed on database nutch
    at com.mongodb.DBTCPConnector.innerCall(DBTCPConnector.java:298)
    at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:269)
    at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:235)
    at com.mongodb.QueryResultIterator.getMore(QueryResultIterator.java:145)
    at com.mongodb.QueryResultIterator.hasNext(QueryResultIterator.java:135)
    at com.mongodb.DBCursor._hasNext(DBCursor.java:626)
    at com.mongodb.DBCursor.hasNext(DBCursor.java:657)
    at org.apache.gora.mongodb.query.MongoDBResult.nextInner(MongoDBResult.java:71)
    at org.apache.gora.query.impl.ResultBase.next(ResultBase.java:111)
    at org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:118)
    ... 12 more
Caused by: java.io.EOFException
    at org.bson.io.Bits.readFully(Bits.java:75)
    at org.bson.io.Bits.readFully(Bits.java:50)
    at org.bson.io.Bits.readFully(Bits.java:37)
    at com.mongodb.Response.<init>(Response.java:42)
    at com.mongodb.DBPort$1.execute(DBPort.java:164)
    at com.mongodb.DBPort$1.execute(DBPort.java:158)
    at com.mongodb.DBPort.doOperation(DBPort.java:187)
    at com.mongodb.DBPort.call(DBPort.java:158)
    at com.mongodb.DBTCPConnector.innerCall(DBTCPConnector.java:290)
    ... 21 more
2016-12-30 00:02:04,846 ERROR crawl.GeneratorJob - GeneratorJob: java.lang.RuntimeException: job failed: name=nutch-maven-1.0-SNAPSHOT.jar, jobid=job_local1740651658_0001
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
    at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:227)
    at org.apache.nutch.crawl.GeneratorJob.generate(GeneratorJob.java:256)
    at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:322)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.GeneratorJob.main(GeneratorJob.java:330)

I figured out that the problem becomes from mongodb version. Nutch uses mongo-java-driver-2.13.1.jar ad I've installed mongodb 3.4.1. So I've installed mongo 2.6.7 and now it works fine. I'll try to update the driver in Nutch and tell you if it works with the new version of mongodb.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM