Spark：如果 DataFrame 有架构，DataFrame 如何成为 Dataset[Row]

Question

This article claims that a DataFrame in Spark is equivalent to a Dataset[Row] , but this blog post shows that a DataFrame has a schema. 这篇文章声称 Spark 中的DataFrame等效于Dataset[Row] ，但这篇博客文章表明DataFrame具有架构。

Take the example in the blog post of converting an RDD to a DataFrame : if DataFrame were the same thing as Dataset[Row] , then converting an RDD to a DataFrame should be as simple以博客文章中将 RDD 转换为DataFrame ：如果DataFrame与Dataset[Row] ，那么将RDD转换为DataFrame应该是一样简单

val rddToDF = rdd.map(value => Row(value))

But instead it shows that it's this但相反它表明它是这个

val rddStringToRowRDD = rdd.map(value => Row(value))
val dfschema = StructType(Array(StructField("value",StringType)))
val rddToDF = sparkSession.createDataFrame(rddStringToRowRDD,dfschema)
val rDDToDataSet = rddToDF.as[String]

Clearly a dataframe is actually a dataset of rows and a schema .显然，数据框实际上是行和模式的数据集。

Answer 1

In Spark 2.0, in code there is: type DataFrame = Dataset[Row]在 Spark 2.0 中，代码中有： type DataFrame = Dataset[Row]

It is Dataset[Row] , just because of definition.它是Dataset[Row] ，只是因为定义。

Dataset has also schema, you can print it using printSchema() function. Dataset也有模式，你可以使用printSchema()函数打印它。 Normally Spark infers schema, so you don't have to write it by yourself - however it's still there ;)通常 Spark 会推断模式，因此您不必自己编写它 - 但它仍然存在；)

You can also do createTempView(name) and use it in SQL queries, just like DataFrames.您还可以执行createTempView(name)并在 SQL 查询中使用它，就像 DataFrames 一样。

In other words, Dataset = DataFrame from Spark 1.5 + encoder , that converts rows to your classes.换句话说， Dataset = DataFrame from Spark 1.5 + encoder ，它将行转换为您的类。 After merging types in Spark 2.0, DataFrame becomes just an alias for Dataset[Row] , so without specified encoder.在 Spark 2.0 中合并类型后，DataFrame 成为Dataset[Row]的别名，因此没有指定的编码器。

About conversions: rdd.map() also returns RDD , it never returns DataFrame.关于转换： rdd.map() 也返回RDD ，它从不返回 DataFrame 。 You can do:你可以这样做：

// Dataset[Row]=DataFrame, without encoder
val rddToDF = sparkSession.createDataFrame(rdd)
// And now it has information, that encoder for String should be used - so it becomes Dataset[String]
val rDDToDataSet = rddToDF.as[String]

// however, it can be shortened to:
val dataset = sparkSession.createDataset(rdd)

Answer 2

Note (in addition to the answer of T Gaweda ) that there is a schema associated to each Row ( Row.schema ).请注意（除了T Gaweda的回答），每个Row ( Row.schema ) 都有一个关联的模式。 However, this schema is not set until it is integrated in a DataFrame (or Dataset[Row] )但是，直到将其集成到DataFrame （或Dataset[Row] ）中时，才会设置此架构

scala> Row(1).schema
res12: org.apache.spark.sql.types.StructType = null

scala> val rdd = sc.parallelize(List(Row(1)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[5] at parallelize at <console>:28
scala> spark.createDataFrame(rdd,schema).first
res15: org.apache.spark.sql.Row = [1]
scala> spark.createDataFrame(rdd,schema).first.schema
res16: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true))

Spark：如果 DataFrame 有架构，DataFrame 如何成为 Dataset[Row]

问题描述

2 个解决方案

解决方案1
9 已采纳 2016-10-07 10:55:20

解决方案2
2 2016-10-07 18:30:28

Spark：如果 DataFrame 有架构，DataFrame 如何成为 Dataset[Row]

问题描述

2 个解决方案

解决方案1 9 已采纳 2016-10-07 10:55:20

解决方案2 2 2016-10-07 18:30:28

解决方案1
9 已采纳 2016-10-07 10:55:20

解决方案2
2 2016-10-07 18:30:28