简体   繁体   English

Spark:数据集序列化

[英]Spark: Dataset Serialization

If I have a dataset each record of which is a case class, and I persist that dataset as shown below so that serialization is used:如果我有一个数据集,其中每个记录都是一个案例类,并且我会按如下所示保留该数据集,以便使用序列化:

myDS.persist(StorageLevel.MERORY_ONLY_SER)

Does Spark use java/kyro serialization to serialize the dataset? Spark 是否使用 java/kyro 序列化来序列化数据集? or just like dataframe, Spark has its own way of storing the data in the dataset?或者就像数据帧一样,Spark 有自己的方式将数据存储在数据集中?

Spark Dataset does not use standard serializers. Spark Dataset不使用标准序列化程序。 Instead it uses Encoders , which "understand" internal structure of the data and can efficiently transform objects (anything that have Encoder , including Row ) into internal binary storage.相反,它使用Encoders ,它“理解”数据的内部结构,并且可以有效地将对象(任何具有Encoder对象,包括Row )转换为内部二进制存储。

The only case where Kryo or Java serialization is used, is when you explicitly apply Encoders.kryo[_] or Encoders.java[_] .使用 Kryo 或 Java 序列化的唯一情况是当您显式应用Encoders.kryo[_]Encoders.java[_] In any other case Spark will destructure the object representation and try to apply standard encoders (atomic encoders, Product encoder, etc.).在任何其他情况下,Spark 将解构对象表示并尝试应用标准编码器(原子编码器、 Product编码器等)。 The only difference compared to Row is its Encoder - RowEncoder (in a sense Encoders are similar to lenses).Row相比唯一的区别是它的Encoder - RowEncoder (从某种意义上说, Encoders类似于镜头)。

Databricks explicitly puts Encoder / Dataset serialization in contrast to Java and Kryo serializers, in its Introducing Apache Spark Datasets (look especially for Lightning-fast Serialization with Encoders section) Databricks 在其Introducing Apache Spark Datasets 中明确地将Encoder / Dataset序列化与 Java 和 Kryo 序列化Encoder形成对比(尤其是使用编码器的闪电快速序列化部分)

在此处输入图片说明

在此处输入图片说明

Source of the images图片来源

Dataset[SomeCaseClass] is not different from Dataset[Row] or any other Dataset . Dataset[SomeCaseClass]Dataset[Row]或任何其他Dataset没有区别。 It uses the same internal representation (mapped to instances of external class when needed ) and the same serialization method.它使用相同的内部表示(在需要的时候被映射到外部类的实例)和相同的序列化方法。

Therefore, the is no need for direct object serialization (Java, Kryo).因此,不需要直接对象序列化(Java、Kryo)。

Under the hood, a dataset is an RDD.在幕后,数据集是一个 RDD。 From the documentation for RDD persistence :来自RDD 持久性文档

Store RDD as serialized Java objects (one byte array per partition).将 RDD 存储为序列化的 Java 对象(每个分区一个字节数组)。 This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.这通常比反序列化对象更节省空间,尤其是在使用快速序列化程序时,但读取时更占用 CPU。

By default, Java serialization is used source :默认情况下,Java 序列化使用source

By default, Spark serializes objects using Java's ObjectOutputStream framework... Spark can also use the Kryo library (version 2) to serialize objects more quickly.默认情况下,Spark 使用 Java 的 ObjectOutputStream 框架序列化对象……Spark 还可以使用 Kryo 库(版本 2)更快地序列化对象。

To enable Kryo, initialize the job with a SparkConf and set spark.serializer to org.apache.spark.serializer.KryoSerializer :要启用KRYO,初始化与SparkConf和设置作业spark.serializerorg.apache.spark.serializer.KryoSerializer

val conf = new SparkConf()
             .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(conf)

You may need to register classes with Kryo before creating the SparkContext:在创建 SparkContext 之前,您可能需要向 Kryo 注册类:

conf.registerKryoClasses(Array(classOf[Class1], classOf[Class2]))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM