简体   繁体   English

连接到火花中的mongodb时出现异常

[英]Exception while connecting to mongodb in spark

I get "java.lang.IllegalStateException: not ready" in org.bson.BasicBSONDecoder._decode while trying to use MongoDB as input RDD: 在尝试使用MongoDB作为输入RDD时,我在org.bson.BasicBSONDecoder._decode中得到“java.lang.IllegalStateException:not ready”:

Configuration conf = new Configuration();
conf.set("mongo.input.uri", "mongodb://127.0.0.1:27017/test.input");

JavaPairRDD<Object, BSONObject> rdd = sc.newAPIHadoopRDD(conf, MongoInputFormat.class, Object.class, BSONObject.class);

System.out.println(rdd.count());

The exception I get is: 14/08/06 09:49:57 INFO rdd.NewHadoopRDD: Input split: 我得到的例外是:14/08/06 09:49:57 INFO rdd.NewHadoopRDD:输入拆分:

MongoInputSplit{URI=mongodb://127.0.0.1:27017/test.input, authURI=null, min={ "_id" : { "$oid" : "53df98d7e4b0a67992b31f8d"}}, max={ "_id" : { "$oid" : "53df98d7e4b0a67992b331b8"}}, query={ }, sort={ }, fields={ }, notimeout=false} 14/08/06 09:49:57 
WARN scheduler.TaskSetManager: Loss was due to java.lang.IllegalStateException 
java.lang.IllegalStateException: not ready
            at org.bson.BasicBSONDecoder._decode(BasicBSONDecoder.java:139)
            at org.bson.BasicBSONDecoder.decode(BasicBSONDecoder.java:123)
            at com.mongodb.hadoop.input.MongoInputSplit.readFields(MongoInputSplit.java:185)
            at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)
            at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:77)
            at org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:42)
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:88)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
            at java.lang.reflect.Method.invoke(Method.java:618)
            at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1089)
            at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1962)
            at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1867)
            at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1419)
            at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2059)
            at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1984)
            at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1867)
            at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1419)
            at java.io.ObjectInputStream.readObject(ObjectInputStream.java:420)
            at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:147)
            at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1906)
            at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1865)
            at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1419)
            at java.io.ObjectInputStream.readObject(ObjectInputStream.java:420)
            at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
            at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
            at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1156)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:626)
            at java.lang.Thread.run(Thread.java:804)

All the program output is here 所有程序输出都在这里

Environment: 环境:

  • Redhat 红帽
  • Spark 1.0.1 Spark 1.0.1
  • Hadoop 2.4.1 Hadoop 2.4.1
  • MongoDB 2.4.10 MongoDB 2.4.10
  • mongo-hadoop-1.3 蒙戈 - Hadoop的1.3

I think I've found the issue: mongodb-hadoop has a "static" modifier on its BSON encoder/decoder instances in core/src/main/java/com/mongodb/hadoop/input/MongoInputSplit.java. 我想我发现了这个问题:mongodb-hadoop在其core / src / main / java / com / mongodb / hadoop / input / MongoInputSplit.java中的BSON编码器/解码器实例上有一个“静态”修饰符。 When Spark runs in multithreaded mode all the threads try and deserialise using the same encoder/decoder instances, which predicatbly has bad results. 当Spark以多线程模式运行时,所有线程都会尝试使用相同的编码器/解码器实例进行反序列化,这可能会导致结果不佳。

Patch on my github here (have submitted a pull request upstream) 修补我的github 这里 (已提交上游拉入请求)

I'm now able to run an 8 core multithreaded Spark->mongo collection count() from Python! 我现在能够从Python运行8核多线程Spark-> mongo集合计数()!

I found the same problem. 我发现了同样的问题。 As a workaround I abandoned the newAPIHadoopRDD way, and implemented a parallel load mechanism based on defining intervals on the document id, and then loading each partition in parallel. 作为一种解决方法,我放弃了新的APIHadoopRDD方式,并基于定义文档id的间隔实现了并行加载机制,然后并行加载每个分区。 The idea is implementing the following mongo shell code by using the MongoDB Java driver: 我们的想法是使用MongoDB Java驱动程序实现以下mongo shell代码:

// Compute min and max id of the collection
db.coll.find({},{_id:1}).sort({_id: 1}).limit(1)
   .forEach(function(doc) {min_id = doc._id})
db.coll.find({},{_id:1}).sort({_id: -1}).limit(1)
   .forEach(function(doc) {max_id = doc._id})

// Compute id ranges
curr_id = min_id
ranges = []
page_size = 1000
// to avoid the use of Comparable in the Java translation
while(! curr_id.equals(max_id)) {
    prev_id = curr_id    
    db.coll.find({_id : {$gte : curr_id}}, {_id : 1})
           .sort({_id: 1})
           .limit(page_size + 1)
           .forEach(function(doc) {
                       curr_id = doc._id
                   })
    ranges.push([prev_id, curr_id])
}

Now we can use the ranges to perform fast queries for collection fragments. 现在我们可以使用范围来对集合片段执行快速查询。 Note the last fragment needs to be treated differently, as just a min constraint, to avoid losing the last document of the collection. 请注意,最后一个片段需要区别对待,仅作为最小约束,以避免丢失集合的最后一个文档。

db.coll.find({_id : {$gte : ranges[1][0], $lt : ranges[1][1]}})
db.coll.find({_id : {$gte : ranges[2][0]}})

I implement this as a Java method 'LinkedList computeIdRanges(DBCollection coll, int rangeSize)' for a simple Range POJO, and then I paralellize the collection and transform it with flatMapToPair to generate an RDD similar to that returned by newAPIHadoopRDD. 我为一个简单的Range POJO实现了这个Java方法'LinkedList computeIdRanges(DBCollection coll,int rangeSize)',然后我对该集合进行并列化并使用flatMapToPair对其进行转换,以生成类似于newAPIHadoopRDD返回的RDD。

List<Range> ranges = computeIdRanges(coll, DEFAULT_RANGE_SIZE);
JavaRDD<Range> parallelRanges = sparkContext.parallelize(ranges, ranges.size());
JavaPairRDD<Object, BSONObject> mongoRDD = 
   parallelRanges.flatMapToPair(
     new PairFlatMapFunction<MongoDBLoader.Range, Object, BSONObject>() {
       ...
       BasicDBObject query = range.max.isPresent() ?
           new BasicDBObject("_id", new BasicDBObject("$gte", range.min)
                            .append("$lt", range.max.get()))
         : new BasicDBObject("_id", new BasicDBObject("$gte", range.min));
       ...

You can play with the size of the ranges and the number of slices used to parallelize, to control the granularity of parallelism. 您可以使用范围的大小和用于并行化的切片数来控制并行度的粒度。

I hope that helps, 我希望有帮助,

Greetings! 问候!

Juan Rodríguez Hortalá JuanRodríguezHortalá

I had the same combination of exceptions after importing a BSON file using mongorestore. 使用mongorestore导入BSON文件后,我有相同的异常组合。 Calling db.collecion.reIndex() solved the problem for me. 调用db.collecion.reIndex()为我解决了这个问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM