简体   繁体   English

如何将基于案例类的RDD转换为DataFrame?

[英]How to convert a case-class-based RDD into a DataFrame?

The Spark documentation shows how to create a DataFrame from an RDD, using Scala case classes to infer a schema. Spark文档显示了如何使用Scala案例类来推断架构,从RDD创建DataFrame。 I am trying to reproduce this concept using sqlContext.createDataFrame(RDD, CaseClass) , but my DataFrame ends up empty. 我试图使用sqlContext.createDataFrame(RDD, CaseClass)重现这个概念,但我的DataFrame结束为空。 Here's my Scala code: 这是我的Scala代码:

// sc is the SparkContext, while sqlContext is the SQLContext.

// Define the case class and raw data
case class Dog(name: String)
val data = Array(
    Dog("Rex"),
    Dog("Fido")
)

// Create an RDD from the raw data
val dogRDD = sc.parallelize(data)

// Print the RDD for debugging (this works, shows 2 dogs)
dogRDD.collect().foreach(println)

// Create a DataFrame from the RDD
val dogDF = sqlContext.createDataFrame(dogRDD, classOf[Dog])

// Print the DataFrame for debugging (this fails, shows 0 dogs)
dogDF.show()

The output I'm seeing is: 我看到的输出是:

Dog(Rex)
Dog(Fido)
++
||
++
||
||
++

What am I missing? 我错过了什么?

Thanks! 谢谢!

All you need is just 所有你需要的只是

val dogDF = sqlContext.createDataFrame(dogRDD)

Second parameter is part of Java API and expects you class follows java beans convention (getters/setters). 第二个参数是Java API的一部分,期望您的类遵循java bean约定(getters / setters)。 Your case class doesn't follow this convention, so no property is detected, that leads to empty DataFrame with no columns. 您的案例类不遵循此约定,因此未检测到任何属性,这会导致没有列的空DataFrame。

您可以使用DataFrame直接从Seq的案例类实例创建toDF ,如下所示:

val dogDf = Seq(Dog("Rex"), Dog("Fido")).toDF

Case Class Approach won't Work in cluster mode. 案例类方法在集群模式下不起作用。 It'll give ClassNotFoundException to the case class you defined. 它会将ClassNotFoundException赋予您定义的case类。

Convert it a RDD[Row] and define the schema of your RDD with StructField and then createDataFrame like 将它转换为RDD[Row]并使用StructField定义RDD的模式,然后将createDataFrame定义为

val rdd = data.map { attrs => Row(attrs(0),attrs(1)) }  

val rddStruct = new StructType(Array(StructField("id", StringType, nullable = true),StructField("pos", StringType, nullable = true)))

sqlContext.createDataFrame(rdd,rddStruct)

toDF() wont work either toDF()不会工作

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM