Spark：数据集序列化

Question

If I have a dataset each record of which is a case class, and I persist that dataset as shown below so that serialization is used:如果我有一个数据集，其中每个记录都是一个案例类，并且我会按如下所示保留该数据集，以便使用序列化：

myDS.persist(StorageLevel.MERORY_ONLY_SER)

Does Spark use java/kyro serialization to serialize the dataset? Spark 是否使用 java/kyro 序列化来序列化数据集？ or just like dataframe, Spark has its own way of storing the data in the dataset?或者就像数据帧一样，Spark 有自己的方式将数据存储在数据集中？

Answer 1

Spark Dataset does not use standard serializers. Spark Dataset不使用标准序列化程序。 Instead it uses Encoders , which "understand" internal structure of the data and can efficiently transform objects (anything that have Encoder , including Row ) into internal binary storage.相反，它使用Encoders ，它“理解”数据的内部结构，并且可以有效地将对象（任何具有Encoder对象，包括Row ）转换为内部二进制存储。

The only case where Kryo or Java serialization is used, is when you explicitly apply Encoders.kryo[_] or Encoders.java[_] .使用 Kryo 或 Java 序列化的唯一情况是当您显式应用Encoders.kryo[_]或Encoders.java[_] 。 In any other case Spark will destructure the object representation and try to apply standard encoders (atomic encoders, Product encoder, etc.).在任何其他情况下，Spark 将解构对象表示并尝试应用标准编码器（原子编码器、 Product编码器等）。 The only difference compared to Row is its Encoder - RowEncoder (in a sense Encoders are similar to lenses).与Row相比唯一的区别是它的Encoder - RowEncoder （从某种意义上说， Encoders类似于镜头）。

Databricks explicitly puts Encoder / Dataset serialization in contrast to Java and Kryo serializers, in its Introducing Apache Spark Datasets (look especially for Lightning-fast Serialization with Encoders section) Databricks 在其Introducing Apache Spark Datasets 中明确地将Encoder / Dataset序列化与 Java 和 Kryo 序列化Encoder形成对比（尤其是使用编码器的闪电快速序列化部分）

Source of the images图片来源

Michael Armbrust, Wenchen Fan, Reynold Xin and Matei Zaharia.迈克尔·阿姆布鲁斯特、范文臣、Reynold Xin 和 Matei Zaharia。 Introducing Apache Spark Datasets, https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html Apache Spark 数据集简介， https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html

Answer 2

Dataset[SomeCaseClass] is not different from Dataset[Row] or any other Dataset . Dataset[SomeCaseClass]与Dataset[Row]或任何其他Dataset没有区别。 It uses the same internal representation (mapped to instances of external class when needed ) and the same serialization method.它使用相同的内部表示（在需要的时候被映射到外部类的实例）和相同的序列化方法。

Therefore, the is no need for direct object serialization (Java, Kryo).因此，不需要直接对象序列化（Java、Kryo）。

Answer 3

Under the hood, a dataset is an RDD.在幕后，数据集是一个 RDD。 From the documentation for RDD persistence :来自RDD 持久性的文档：

Store RDD as serialized Java objects (one byte array per partition).将 RDD 存储为序列化的 Java 对象（每个分区一个字节数组）。 This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.这通常比反序列化对象更节省空间，尤其是在使用快速序列化程序时，但读取时更占用 CPU。

By default, Java serialization is used source :默认情况下，Java 序列化使用source ：

By default, Spark serializes objects using Java's ObjectOutputStream framework... Spark can also use the Kryo library (version 2) to serialize objects more quickly.默认情况下，Spark 使用 Java 的 ObjectOutputStream 框架序列化对象……Spark 还可以使用 Kryo 库（版本 2）更快地序列化对象。

To enable Kryo, initialize the job with a SparkConf and set spark.serializer to org.apache.spark.serializer.KryoSerializer :要启用KRYO，初始化与SparkConf和设置作业spark.serializer到org.apache.spark.serializer.KryoSerializer ：

val conf = new SparkConf()
             .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(conf)

You may need to register classes with Kryo before creating the SparkContext:在创建 SparkContext 之前，您可能需要向 Kryo 注册类：

conf.registerKryoClasses(Array(classOf[Class1], classOf[Class2]))

Spark：数据集序列化

问题描述

3 个解决方案

解决方案1
10 2017-12-26 22:43:12

解决方案2
2 2017-12-26 22:01:42

解决方案3
-1 2017-12-26 22:11:09

Spark：数据集序列化

问题描述

3 个解决方案

解决方案1 10 2017-12-26 22:43:12

解决方案2 2 2017-12-26 22:01:42

解决方案3 -1 2017-12-26 22:11:09

解决方案1
10 2017-12-26 22:43:12

解决方案2
2 2017-12-26 22:01:42

解决方案3
-1 2017-12-26 22:11:09