[英]spark.createDataFrame () not working with Seq RDD
CreateDataFrame takes 2 arguments , an rdd and schema. CreateDataFrame带有2个参数,rdd和schema。
my schema is like this 我的架构是这样的
val schemas= StructType( Seq( StructField("number",IntegerType,false), StructField("notation", StringType,false) ) )
in one case i am able to create dataframe from RDD like below: 在一种情况下,我可以从RDD创建数据框,如下所示:
`val data1=Seq(Row(1,"one"),Row(2,"two"))
val rdd=spark.sparkContext.parallelize(data1)
val final_df= spark.createDataFrame(rdd,schemas)`
In other case like below .. i am not able to 在下面的其他情况下..我无法
`val data2=Seq((1,"one"),(2,"two"))
val rdd=spark.sparkContext.parallelize(data2)
val final_df= spark.createDataFrame(rdd,schemas)`
Whats wrong with data2 for not able to become a valid rdd for Dataframe? data2出了什么问题,因为它无法成为Dataframe的有效rdd?
but we can able to create dataframe using toDF() with data2 but not CreateDataFrame. 但是我们可以使用带有数据2的toDF()创建数据帧,但不能使用CreateDataFrame创建数据帧。
val data2_DF=Seq((1,"one"),(2,"two")).toDF("number", "notation") val data2_DF = Seq((1,“ one”),(2,“ two”))。toDF(“ number”,“ notation”)
Please help me understand this behaviour. 请帮助我了解这种行为。
Is Row mandatory while creating dataframe? 创建数据框时,Row是强制性的吗?
In the second case, just do : 在第二种情况下,只需执行以下操作:
val final_df = spark.createDataFrame(rdd)
Because your RDD is an RDD of Tuple2
(which is a Product
), the schema is known at compile time, so you don't need to specify a schema 因为您的RDD是
Tuple2
的RDD(这是一个Product
),所以该架构在编译时是已知的,因此您无需指定架构
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.