简体   繁体   中英

Spark: Dataset Serialization

If I have a dataset each record of which is a case class, and I persist that dataset as shown below so that serialization is used:

myDS.persist(StorageLevel.MERORY_ONLY_SER)

Does Spark use java/kyro serialization to serialize the dataset? or just like dataframe, Spark has its own way of storing the data in the dataset?

Spark Dataset does not use standard serializers. Instead it uses Encoders , which "understand" internal structure of the data and can efficiently transform objects (anything that have Encoder , including Row ) into internal binary storage.

The only case where Kryo or Java serialization is used, is when you explicitly apply Encoders.kryo[_] or Encoders.java[_] . In any other case Spark will destructure the object representation and try to apply standard encoders (atomic encoders, Product encoder, etc.). The only difference compared to Row is its Encoder - RowEncoder (in a sense Encoders are similar to lenses).

Databricks explicitly puts Encoder / Dataset serialization in contrast to Java and Kryo serializers, in its Introducing Apache Spark Datasets (look especially for Lightning-fast Serialization with Encoders section)

在此处输入图片说明

在此处输入图片说明

Source of the images

Dataset[SomeCaseClass] is not different from Dataset[Row] or any other Dataset . It uses the same internal representation (mapped to instances of external class when needed ) and the same serialization method.

Therefore, the is no need for direct object serialization (Java, Kryo).

Under the hood, a dataset is an RDD. From the documentation for RDD persistence :

Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.

By default, Java serialization is used source :

By default, Spark serializes objects using Java's ObjectOutputStream framework... Spark can also use the Kryo library (version 2) to serialize objects more quickly.

To enable Kryo, initialize the job with a SparkConf and set spark.serializer to org.apache.spark.serializer.KryoSerializer :

val conf = new SparkConf()
             .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(conf)

You may need to register classes with Kryo before creating the SparkContext:

conf.registerKryoClasses(Array(classOf[Class1], classOf[Class2]))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM