简体繁体 English

Spark DataFrame 和 Dataset 之间的编码器差异

[英]Encoder difference between Spark DataFrame and Dataset

原文 2020-08-02 05:29:12 7 1 scala/ apache-spark/ apache-spark-sql

When reading about the differences between Spark's DataFrame (which is an alias for Dataset[Row] ) and Dataset , it's often mentioned that Dataset s make use of Encoders to efficiently convert to/from JVM objects to Spark's internal data representation.在阅读有关 Spark 的DataFrame （它是Dataset[Row]的别名）和Dataset之间的差异时，经常提到Dataset使用Encoders来有效地将 JVM 对象转换为 Spark 的内部数据表示。 In scala, there are implicit encoders provided for case classes and primitive types.在 scala 中，为案例类和原始类型提供了隐式编码器。 However, there is also the RowEncoder which, I believe, achieves encoding for the Row in DataFrames.但是，我相信还有RowEncoder可以实现对 DataFrame 中的Row的编码。

My questions are我的问题是

In terms of efficient conversion between JVM objects and Spark's internal binary representation, are DataFrame s and Dataset s the same in performance?在 JVM 对象和 Spark 内部二进制表示之间的高效转换方面， DataFrame和Dataset在性能上是否相同？
What additional benefits do a specific type (like a case class in Scala) provide over the generic Row as far as Encoding (serializing/deserializing) goes?就编码（序列化/反序列化）而言，特定类型（如 Scala 中的 class 案例）比通用Row提供了哪些额外的好处？ Apart from compile-time type-safety, do typed JVM objects provide any advantage over semi-typed (or "untyped") Row ?除了编译时类型安全之外，类型化的 JVM 对象是否比半类型化（或“非类型化”） Row提供任何优势？

1 个解决方案

Dataframes are just datasets with an encoder for the Spark Row class.数据帧只是带有 Spark Row class 编码器的数据集。 So in essence a Dataframe is a Dataset.所以本质上 Dataframe 是一个数据集。

Encoders also do not come into play at all unless you are using non column functions (functions that take a lambda like map, reduce, flatmap.) The moment that you do use one of those functions there will be a performance hit as you break the codegen that catalyst is doing into two parts since it can't optimize the lambda.编码器也根本不会发挥作用，除非您使用非列函数（采用 lambda 的函数，如 map、reduce、flatmap。）当您使用其中一个函数时，当您打破催化剂正在执行的代码生成分为两部分，因为它无法优化 lambda。 This means your probably don't want to be using those functions at all and can ignore the dataset/Dataframe difference entirely since if you don't use those functions you won't ever encode.这意味着您可能根本不想使用这些函数，并且可以完全忽略数据集/数据帧的差异，因为如果您不使用这些函数，您将永远不会编码。

In my experience the benefit of the type safety you can get with a Dataset and the types apis is not worth the huge perf hit.以我的经验，您可以通过数据集获得类型安全的好处，并且类型 api 不值得巨大的性能打击。 In almost all cases I've found that you should stay in Dataframes and only use column based functions and udfs for best performance.在几乎所有情况下，我发现您应该留在 Dataframes 中，并且只使用基于列的函数和 udfs 以获得最佳性能。

As an additional note, the only other time an encoder will be used is when you parallelize a collection, all Datasources will provide Rows or Internal rows to Spark so your encoder will not be used for most sources.另外需要注意的是，唯一一次使用编码器的情况是当您并行化集合时，所有数据源都会向 Spark 提供行或内部行，因此您的编码器不会用于大多数源。