[英]Spark: How can DataFrame be Dataset[Row] if DataFrame's have a schema
This article claims that a DataFrame
in Spark is equivalent to a Dataset[Row]
, but this blog post shows that a DataFrame
has a schema. 这篇文章声称 Spark 中的
DataFrame
等效于Dataset[Row]
,但这篇博客文章表明DataFrame
具有架构。
Take the example in the blog post of converting an RDD to a DataFrame
: if DataFrame
were the same thing as Dataset[Row]
, then converting an RDD
to a DataFrame
should be as simple以博客文章中将 RDD 转换为
DataFrame
:如果DataFrame
与Dataset[Row]
,那么将RDD
转换为DataFrame
应该是一样简单
val rddToDF = rdd.map(value => Row(value))
But instead it shows that it's this但相反它表明它是这个
val rddStringToRowRDD = rdd.map(value => Row(value))
val dfschema = StructType(Array(StructField("value",StringType)))
val rddToDF = sparkSession.createDataFrame(rddStringToRowRDD,dfschema)
val rDDToDataSet = rddToDF.as[String]
Clearly a dataframe is actually a dataset of rows and a schema .显然,数据框实际上是行和模式的数据集。
In Spark 2.0, in code there is: type DataFrame = Dataset[Row]
在 Spark 2.0 中,代码中有:
type DataFrame = Dataset[Row]
It is Dataset[Row]
, just because of definition.它是
Dataset[Row]
,只是因为定义。
Dataset
has also schema, you can print it using printSchema()
function. Dataset
也有模式,你可以使用printSchema()
函数打印它。 Normally Spark infers schema, so you don't have to write it by yourself - however it's still there ;)通常 Spark 会推断模式,因此您不必自己编写它 - 但它仍然存在;)
You can also do createTempView(name)
and use it in SQL queries, just like DataFrames.您还可以执行
createTempView(name)
并在 SQL 查询中使用它,就像 DataFrames 一样。
In other words, Dataset
= DataFrame from Spark 1.5
+ encoder
, that converts rows to your classes.换句话说,
Dataset
= DataFrame from Spark 1.5
+ encoder
,它将行转换为您的类。 After merging types in Spark 2.0, DataFrame becomes just an alias for Dataset[Row]
, so without specified encoder.在 Spark 2.0 中合并类型后,DataFrame 成为
Dataset[Row]
的别名,因此没有指定的编码器。
About conversions: rdd.map() also returns RDD
, it never returns DataFrame.关于转换: rdd.map() 也返回
RDD
,它从不返回 DataFrame 。 You can do:你可以这样做:
// Dataset[Row]=DataFrame, without encoder
val rddToDF = sparkSession.createDataFrame(rdd)
// And now it has information, that encoder for String should be used - so it becomes Dataset[String]
val rDDToDataSet = rddToDF.as[String]
// however, it can be shortened to:
val dataset = sparkSession.createDataset(rdd)
Note (in addition to the answer of T Gaweda ) that there is a schema associated to each Row
( Row.schema
).请注意(除了T Gaweda的回答),每个
Row
( Row.schema
) 都有一个关联的模式。 However, this schema is not set until it is integrated in a DataFrame
(or Dataset[Row]
)但是,直到将其集成到
DataFrame
(或Dataset[Row]
)中时,才会设置此架构
scala> Row(1).schema
res12: org.apache.spark.sql.types.StructType = null
scala> val rdd = sc.parallelize(List(Row(1)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[5] at parallelize at <console>:28
scala> spark.createDataFrame(rdd,schema).first
res15: org.apache.spark.sql.Row = [1]
scala> spark.createDataFrame(rdd,schema).first.schema
res16: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.