Spark: Dataset Serialization

Question

If I have a dataset each record of which is a case class, and I persist that dataset as shown below so that serialization is used:

myDS.persist(StorageLevel.MERORY_ONLY_SER)

Does Spark use java/kyro serialization to serialize the dataset? or just like dataframe, Spark has its own way of storing the data in the dataset?

Answer 1

Spark Dataset does not use standard serializers. Instead it uses Encoders , which "understand" internal structure of the data and can efficiently transform objects (anything that have Encoder , including Row ) into internal binary storage.

The only case where Kryo or Java serialization is used, is when you explicitly apply Encoders.kryo[_] or Encoders.java[_] . In any other case Spark will destructure the object representation and try to apply standard encoders (atomic encoders, Product encoder, etc.). The only difference compared to Row is its Encoder - RowEncoder (in a sense Encoders are similar to lenses).

Databricks explicitly puts Encoder / Dataset serialization in contrast to Java and Kryo serializers, in its Introducing Apache Spark Datasets (look especially for Lightning-fast Serialization with Encoders section)

Source of the images

Michael Armbrust, Wenchen Fan, Reynold Xin and Matei Zaharia. Introducing Apache Spark Datasets, https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html

Answer 2

Dataset[SomeCaseClass] is not different from Dataset[Row] or any other Dataset . It uses the same internal representation (mapped to instances of external class when needed ) and the same serialization method.

Therefore, the is no need for direct object serialization (Java, Kryo).

Answer 3

Under the hood, a dataset is an RDD. From the documentation for RDD persistence :

Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.

By default, Java serialization is used source :

By default, Spark serializes objects using Java's ObjectOutputStream framework... Spark can also use the Kryo library (version 2) to serialize objects more quickly.

To enable Kryo, initialize the job with a SparkConf and set spark.serializer to org.apache.spark.serializer.KryoSerializer :

val conf = new SparkConf()
             .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(conf)

You may need to register classes with Kryo before creating the SparkContext:

conf.registerKryoClasses(Array(classOf[Class1], classOf[Class2]))

Spark: Dataset Serialization

Question

3 answers

solution1
10 2017-12-26 22:43:12

solution2
2 2017-12-26 22:01:42

solution3
-1 2017-12-26 22:11:09

Spark: Dataset Serialization

Question

3 answers

solution1 10 2017-12-26 22:43:12

solution2 2 2017-12-26 22:01:42

solution3 -1 2017-12-26 22:11:09

solution1
10 2017-12-26 22:43:12

solution2
2 2017-12-26 22:01:42

solution3
-1 2017-12-26 22:11:09