I am hitting a bug when using java protocol buffer classes as the object model for RDDs in Spark jobs,
For my application, my ,proto file has properties that are repeated string. For example
message OntologyHumanName
{
repeated string family = 1;
}
From this, the 2.5.0 protoc compiler generates Java code like
private com.google.protobuf.LazyStringList family_ = com.google.protobuf.LazyStringArrayList.EMPTY;
If I run a Scala Spark job that uses the Kryo serializer I get the following error
Caused by: java.lang.NullPointerException
at com.google.protobuf.UnmodifiableLazyStringList.size(UnmodifiableLazyStringList.java:61)
at java.util.AbstractList.add(AbstractList.java:108)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
... 40 more
The same code works fine with spark.serializer=org.apache.spark.serializer.JavaSerializer.
My environment is CDH QuickStart 5.5 with JDK 1.8.0_60
Try to register the Lazy
class with:
Kryo kryo = new Kryo()
kryo.register(com.google.protobuf.LazyStringArrayList.class)
Also for custom Protobuf messages take a look at the solution in this answer for registering custom/nestes classes generated by protoc
.
I think your RDD's type contains class OntologyHumanName. like: RDD[(String, OntologyHumanName)], and this type RDD in shuffle stage by coincidence. View this: https://github.com/EsotericSoftware/kryo#kryoserializable kryo can't do serialization on abstract class.
Read the spark doc: http://spark.apache.org/docs/latest/tuning.html#data-serialization
val conf = new SparkConf().setMaster(...).setAppName(...) conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2])) val sc = new SparkContext(conf)
on kryo doc:
public class SomeClass implements KryoSerializable { // ... public void write (Kryo kryo, Output output) { // ... } public void read (Kryo kryo, Input input) { // ... } }
but the class: OntologyHumanName is generated by protobuf automatically. So I don't think this's a good way to do.
Try to use case class replace OntologyHumanName to avoid doing serialization on class OntologyHumanName directly. This way I didn't try, it's doesn't work possiblly.
case class OntologyHumanNameScalaCaseClass(val humanNames: OntologyHumanName)
An ugly way. I just converted protobuf class to scala things. This way can't be failed. like:
import scala.collection.JavaConverters._ val humanNameObj: OntologyHumanName = ... val families: List[String] = humamNameObj.getFamilyList.asScala //use this to replace the humanNameObj.
hope resolve your problem above.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.