简体   繁体   中英

Error using Spark's Kryo serializer with java protocol buffers that have arrays of strings

I am hitting a bug when using java protocol buffer classes as the object model for RDDs in Spark jobs,

For my application, my ,proto file has properties that are repeated string. For example

message OntologyHumanName 
{ 
repeated string family = 1;
}

From this, the 2.5.0 protoc compiler generates Java code like

private com.google.protobuf.LazyStringList family_ = com.google.protobuf.LazyStringArrayList.EMPTY;

If I run a Scala Spark job that uses the Kryo serializer I get the following error

Caused by: java.lang.NullPointerException
at com.google.protobuf.UnmodifiableLazyStringList.size(UnmodifiableLazyStringList.java:61)
at java.util.AbstractList.add(AbstractList.java:108)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
... 40 more

The same code works fine with spark.serializer=org.apache.spark.serializer.JavaSerializer.

My environment is CDH QuickStart 5.5 with JDK 1.8.0_60

Try to register the Lazy class with:

Kryo kryo = new Kryo()

kryo.register(com.google.protobuf.LazyStringArrayList.class)

Also for custom Protobuf messages take a look at the solution in this answer for registering custom/nestes classes generated by protoc .

I think your RDD's type contains class OntologyHumanName. like: RDD[(String, OntologyHumanName)], and this type RDD in shuffle stage by coincidence. View this: https://github.com/EsotericSoftware/kryo#kryoserializable kryo can't do serialization on abstract class.

  1. Read the spark doc: http://spark.apache.org/docs/latest/tuning.html#data-serialization

     val conf = new SparkConf().setMaster(...).setAppName(...) conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2])) val sc = new SparkContext(conf)
  2. on kryo doc:

     public class SomeClass implements KryoSerializable { // ... public void write (Kryo kryo, Output output) { // ... } public void read (Kryo kryo, Input input) { // ... } }

but the class: OntologyHumanName is generated by protobuf automatically. So I don't think this's a good way to do.

  1. Try to use case class replace OntologyHumanName to avoid doing serialization on class OntologyHumanName directly. This way I didn't try, it's doesn't work possiblly.

     case class OntologyHumanNameScalaCaseClass(val humanNames: OntologyHumanName)
  2. An ugly way. I just converted protobuf class to scala things. This way can't be failed. like:

     import scala.collection.JavaConverters._ val humanNameObj: OntologyHumanName = ... val families: List[String] = humamNameObj.getFamilyList.asScala //use this to replace the humanNameObj.

hope resolve your problem above.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM