简体   繁体   English

spark.createDataFrame()无法与Seq RDD一起使用

[英]spark.createDataFrame () not working with Seq RDD

CreateDataFrame takes 2 arguments , an rdd and schema. CreateDataFrame带有2个参数,rdd和schema。

my schema is like this 我的架构是这样的

val schemas= StructType( Seq( StructField("number",IntegerType,false), StructField("notation", StringType,false) ) )

in one case i am able to create dataframe from RDD like below: 在一种情况下,我可以从RDD创建数据框,如下所示:

`val data1=Seq(Row(1,"one"),Row(2,"two"))

val rdd=spark.sparkContext.parallelize(data1)

val final_df= spark.createDataFrame(rdd,schemas)`

In other case like below .. i am not able to 在下面的其他情况下..我无法

`val data2=Seq((1,"one"),(2,"two"))

val rdd=spark.sparkContext.parallelize(data2)

val final_df= spark.createDataFrame(rdd,schemas)`

Whats wrong with data2 for not able to become a valid rdd for Dataframe? data2出了什么问题,因为它无法成为Dataframe的有效rdd?

but we can able to create dataframe using toDF() with data2 but not CreateDataFrame. 但是我们可以使用带有数据2的toDF()创建数据帧,但不能使用CreateDataFrame创建数据帧。

val data2_DF=Seq((1,"one"),(2,"two")).toDF("number", "notation") val data2_DF = Seq((1,“ one”),(2,“ two”))。toDF(“ number”,“ notation”)

Please help me understand this behaviour. 请帮助我了解这种行为。

Is Row mandatory while creating dataframe? 创建数据框时,Row是强制性的吗?

In the second case, just do : 在第二种情况下,只需执行以下操作:

val final_df = spark.createDataFrame(rdd)

Because your RDD is an RDD of Tuple2 (which is a Product ), the schema is known at compile time, so you don't need to specify a schema 因为您的RDD是Tuple2的RDD(这是一个Product ),所以该架构在编译时是已知的,因此您无需指定架构

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM