简体   繁体   English

Spark Dataset 相对于 DataFrame 的劣势

[英]Disadvantages of Spark Dataset over DataFrame

I know the advantages of Dataset (type safety etc), but i can't find any documentation related Spark Datasets Limitations .我知道Dataset的优点(类型安全等),但我找不到任何与Spark 数据集限制相关的文档。

Are there any specific scenarios where Spark Dataset is not recommended and better to use DataFrame .是否有不推荐使用 Spark Dataset而最好使用DataFrame特定场景。

Currently all our data engineering flows are using Spark (Scala) DataFrame .目前我们所有的数据工程流程都使用 Spark (Scala) DataFrame We would like to make use of Dataset , for all our new flows.我们希望将Dataset用于我们所有的新流程。 So knowing all the limitations/disadvantages of Dataset would help us.因此,了解Dataset所有限制/缺点将对我们有所帮助。

EDIT: This is not similar to Spark 2.0 Dataset vs DataFrame , which explains some operations on Dataframe/Dataset.编辑:这与Spark 2.0 Dataset vs DataFrame 不同,它解释了对 Dataframe/Dataset 的一些操作。 or other questions, which most of them explains the differences between rdd, dataframe and dataset and how they evolved.或其他问题,其中大多数解释了 rdd、数据框和数据集之间的差异以及它们是如何演变的。 This is targeted to know, when NOT to use Datasets这是旨在了解何时不使用数据集

There are a few scenarios where I find that a Dataframe (or Dataset[Row]) is more useful than a typed dataset.在某些情况下,我发现 Dataframe(或 Dataset[Row])比类型化数据集更有用。

For example, when I'm consuming data without a fixed schema, like JSON files containing records of different types with different fields.例如,当我使用没有固定模式的数据时,例如 JSON 文件包含具有不同字段的不同类型的记录。 Using a Dataframe I can easily "select" out the fields I need without needing to know the whole schema, or even use a runtime configuration to specify the fields I'll access.使用 Dataframe,我可以轻松地“选择”出我需要的字段,而无需知道整个架构,甚至可以使用运行时配置来指定我将访问的字段。

Another consideration is that Spark can better optimize the built-in Spark SQL operations and aggregations than UDAFs and custom lambdas.另一个考虑是 Spark 可以比 UDAF 和自定义 lambda 更好地优化内置的 Spark SQL 操作和聚合。 So if you want to get the square root of a value in a column, that's a built-in function ( df.withColumn("rootX", sqrt("X")) ) in Spark SQL but doing it in a lambda ( ds.map(X => Math.sqrt(X)) ) would be less efficient since Spark can't optimize your lambda function as effectively.因此,如果您想获得列中某个值的平方根,那是 Spark SQL 中的内置函数( df.withColumn("rootX", sqrt("X")) ),但在 lambda( ds.map(X => Math.sqrt(X)) ) 的效率会降低,因为 Spark 无法有效地优化您的 lambda 函数。

There are also many untyped Dataframe functions (like statistical functions) that are implemented for Dataframes but not typed Datasets, and you'll often find that even if you start out with a Dataset, by the time you've finished your aggregations you're left with a Dataframe because the functions work by creating new columns, modifying the schema of your dataset.还有许多非类型化的 Dataframe 函数(如统计函数)是为 Dataframes 而不是类型化的 Datasets 实现的,你经常会发现,即使你从一个 Dataset 开始,当你完成聚合时剩下一个 Dataframe,因为这些函数通过创建新列、修改数据集的架构来工作。

In general I don't think you should migrate from working Dataframe code to typed Datasets unless you have a good reason to.一般来说,除非您有充分的理由,否则我认为您不应该从工作数据帧代码迁移到类型化数据集。 Many of the Dataset features are still flagged as "experimental" as of Spark 2.4.0, and as mentioned above not all Dataframe features have Dataset equivalents.从 Spark 2.4.0 开始,许多 Dataset 功能仍被标记为“实验性”,并且如上所述,并非所有 Dataframe 功能都具有 Dataset 等效项。

Limitations of Spark Datasets: Spark 数据集的局限性:

  1. Datasets used to be less performant (not sure if that's been fixed yet)数据集过去性能较差(不确定是否已修复)
  2. You need to define a new case class whenever you change the Dataset schema, which is cumbersome每次更改Dataset schema时都需要定义一个新的case类,比较麻烦
  3. Datasets don't offer as much type safety as you might expect.数据集提供的类型安全性没有你想象的那么高。 We can pass the reverse function a date object and it'll return a garbage response rather than erroring out.我们可以向reverse函数传递一个日期对象,它会返回一个垃圾响应而不是出错。
import java.sql.Date

case class Birth(hospitalName: String, birthDate: Date)

val birthsDS = Seq(
  Birth("westchester", Date.valueOf("2014-01-15"))
).toDS()
birthsDS.withColumn("meaningless", reverse($"birthDate")).show()
+------------+----------+-----------+
|hospitalName| birthDate|meaningless|
+------------+----------+-----------+
| westchester|2014-01-15| 51-10-4102|
+------------+----------+-----------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM