简体   繁体   中英

Disadvantages of Spark Dataset over DataFrame

I know the advantages of Dataset (type safety etc), but i can't find any documentation related Spark Datasets Limitations .

Are there any specific scenarios where Spark Dataset is not recommended and better to use DataFrame .

Currently all our data engineering flows are using Spark (Scala) DataFrame . We would like to make use of Dataset , for all our new flows. So knowing all the limitations/disadvantages of Dataset would help us.

EDIT: This is not similar to Spark 2.0 Dataset vs DataFrame , which explains some operations on Dataframe/Dataset. or other questions, which most of them explains the differences between rdd, dataframe and dataset and how they evolved. This is targeted to know, when NOT to use Datasets

There are a few scenarios where I find that a Dataframe (or Dataset[Row]) is more useful than a typed dataset.

For example, when I'm consuming data without a fixed schema, like JSON files containing records of different types with different fields. Using a Dataframe I can easily "select" out the fields I need without needing to know the whole schema, or even use a runtime configuration to specify the fields I'll access.

Another consideration is that Spark can better optimize the built-in Spark SQL operations and aggregations than UDAFs and custom lambdas. So if you want to get the square root of a value in a column, that's a built-in function ( df.withColumn("rootX", sqrt("X")) ) in Spark SQL but doing it in a lambda ( ds.map(X => Math.sqrt(X)) ) would be less efficient since Spark can't optimize your lambda function as effectively.

There are also many untyped Dataframe functions (like statistical functions) that are implemented for Dataframes but not typed Datasets, and you'll often find that even if you start out with a Dataset, by the time you've finished your aggregations you're left with a Dataframe because the functions work by creating new columns, modifying the schema of your dataset.

In general I don't think you should migrate from working Dataframe code to typed Datasets unless you have a good reason to. Many of the Dataset features are still flagged as "experimental" as of Spark 2.4.0, and as mentioned above not all Dataframe features have Dataset equivalents.

Limitations of Spark Datasets:

  1. Datasets used to be less performant (not sure if that's been fixed yet)
  2. You need to define a new case class whenever you change the Dataset schema, which is cumbersome
  3. Datasets don't offer as much type safety as you might expect. We can pass the reverse function a date object and it'll return a garbage response rather than erroring out.
import java.sql.Date

case class Birth(hospitalName: String, birthDate: Date)

val birthsDS = Seq(
  Birth("westchester", Date.valueOf("2014-01-15"))
).toDS()
birthsDS.withColumn("meaningless", reverse($"birthDate")).show()
+------------+----------+-----------+
|hospitalName| birthDate|meaningless|
+------------+----------+-----------+
| westchester|2014-01-15| 51-10-4102|
+------------+----------+-----------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM