[英]How to match Dataframe column names to Scala case class attributes?
The column names in this example from spark-sql come from the case class Person
. 来自spark-sql的此示例中的列名来自
case class Person
。
case class Person(name: String, age: Int)
val people: RDD[Person] = ... // An RDD of case class objects, from the previous example.
// The RDD is implicitly converted to a SchemaRDD by createSchemaRDD, allowing it to be stored using Parquet.
people.saveAsParquetFile("people.parquet")
https://spark.apache.org/docs/1.1.0/sql-programming-guide.html https://spark.apache.org/docs/1.1.0/sql-programming-guide.html
However in many cases the parameter names may be changed. 但是,在许多情况下,参数名称可能会更改。 This would cause columns to not be found if the file has not been updated to reflect the change.
如果文件尚未更新以反映更改,则会导致找不到列。
How can I specify an appropriate mapping? 如何指定适当的映射?
I am thinking something like: 我想的是:
val schema = StructType(Seq(
StructField("name", StringType, nullable = false),
StructField("age", IntegerType, nullable = false)
))
val ps: Seq[Person] = ???
val personRDD = sc.parallelize(ps)
// Apply the schema to the RDD.
val personDF: DataFrame = sqlContext.createDataFrame(personRDD, schema)
Basically, all the mapping you need to do can be achieved with DataFrame.select(...)
. 基本上,您需要做的所有映射都可以通过
DataFrame.select(...)
来实现。 (Here, I assume, that no type conversions need to be done.) Given the forward- and backward-mapping as maps, the essential part is (这里,我假设,不需要进行任何类型的转换。)给定前向和后向映射作为映射,基本部分是
val mapping = from.map{ (x:(String, String)) => personsDF(x._1).as(x._2) }.toArray
// personsDF your original dataframe
val mappedDF = personsDF.select( mapping: _* )
where mapping is an array of Column
s with alias. 其中mapping是带有别名的
Column
s数组。
object Example {
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
case class Person(name: String, age: Int)
object Mapping {
val from = Map("name" -> "a", "age" -> "b")
val to = Map("a" -> "name", "b" -> "age")
}
def main(args: Array[String]) : Unit = {
// init
val conf = new SparkConf()
.setAppName( "Example." )
.setMaster( "local[*]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
// create persons
val persons = Seq(Person("bob", 35), Person("alice", 27))
val personsRDD = sc.parallelize(persons, 4)
val personsDF = personsRDD.toDF
writeParquet( personsDF, "persons.parquet", sc, sqlContext)
val otherPersonDF = readParquet( "persons.parquet", sc, sqlContext )
}
def writeParquet(personsDF: DataFrame, path:String, sc: SparkContext, sqlContext: SQLContext) : Unit = {
import Mapping.from
val mapping = from.map{ (x:(String, String)) => personsDF(x._1).as(x._2) }.toArray
val mappedDF = personsDF.select( mapping: _* )
mappedDF.write.parquet("/output/path.parquet") // parquet with columns "a" and "b"
}
def readParquet(path: String, sc: SparkContext, sqlContext: SQLContext) : Unit = {
import Mapping.to
val df = sqlContext.read.parquet(path) // this df has columns a and b
val mapping = to.map{ (x:(String, String)) => df(x._1).as(x._2) }.toArray
df.select( mapping: _* )
}
}
If you need to convert a dataframe back to an RDD[Person], then 如果需要将数据帧转换回RDD [Person],那么
val rdd : RDD[Row] = personsDF.rdd
val personsRDD : Rdd[Person] = rdd.map { r: Row =>
Person( r.getAs("person"), r.getAs("age") )
}
Have also a look at How to convert spark SchemaRDD into RDD of my case class? 还要看看如何将spark SchemaRDD转换为我的case类的RDD?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.