如何将Dataframe列名称与Scala案例类属性相匹配？

Question

The column names in this example from spark-sql come from the case class Person . 来自spark-sql的此示例中的列名来自case class Person 。

case class Person(name: String, age: Int)

val people: RDD[Person] = ... // An RDD of case class objects, from the previous example.

// The RDD is implicitly converted to a SchemaRDD by createSchemaRDD, allowing it to be stored using Parquet.
people.saveAsParquetFile("people.parquet")

https://spark.apache.org/docs/1.1.0/sql-programming-guide.html https://spark.apache.org/docs/1.1.0/sql-programming-guide.html

However in many cases the parameter names may be changed. 但是，在许多情况下，参数名称可能会更改。 This would cause columns to not be found if the file has not been updated to reflect the change. 如果文件尚未更新以反映更改，则会导致找不到列。

How can I specify an appropriate mapping? 如何指定适当的映射？

I am thinking something like: 我想的是：

  val schema = StructType(Seq(
    StructField("name", StringType, nullable = false),
    StructField("age", IntegerType, nullable = false)
  ))


  val ps: Seq[Person] = ???

  val personRDD = sc.parallelize(ps)

  // Apply the schema to the RDD.
  val personDF: DataFrame = sqlContext.createDataFrame(personRDD, schema)

Answer 1

Basically, all the mapping you need to do can be achieved with DataFrame.select(...) . 基本上，您需要做的所有映射都可以通过DataFrame.select(...)来实现。 (Here, I assume, that no type conversions need to be done.) Given the forward- and backward-mapping as maps, the essential part is （这里，我假设，不需要进行任何类型的转换。）给定前向和后向映射作为映射，基本部分是

val mapping = from.map{ (x:(String, String)) => personsDF(x._1).as(x._2) }.toArray
// personsDF your original dataframe  
val mappedDF = personsDF.select( mapping: _* )

where mapping is an array of Column s with alias. 其中mapping是带有别名的Column s数组。

Example code 示例代码

object Example {   

  import org.apache.spark.rdd.RDD
  import org.apache.spark.{SparkContext, SparkConf}

  case class Person(name: String, age: Int)

  object Mapping {
    val from = Map("name" -> "a", "age" -> "b")
    val to = Map("a" -> "name", "b" -> "age")
  }

  def main(args: Array[String]) : Unit = {
    // init
    val conf = new SparkConf()
      .setAppName( "Example." )
      .setMaster( "local[*]")

    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._

    // create persons
    val persons = Seq(Person("bob", 35), Person("alice", 27))
    val personsRDD = sc.parallelize(persons, 4)
    val personsDF = personsRDD.toDF

    writeParquet( personsDF, "persons.parquet", sc, sqlContext)

    val otherPersonDF = readParquet( "persons.parquet", sc, sqlContext )
  }

  def writeParquet(personsDF: DataFrame, path:String, sc: SparkContext, sqlContext: SQLContext) : Unit = {
    import Mapping.from

    val mapping = from.map{ (x:(String, String)) => personsDF(x._1).as(x._2) }.toArray

    val mappedDF = personsDF.select( mapping: _* )
    mappedDF.write.parquet("/output/path.parquet") // parquet with columns "a" and "b"
  }

  def readParquet(path: String, sc: SparkContext, sqlContext: SQLContext) : Unit = {
    import Mapping.to
    val df = sqlContext.read.parquet(path) // this df has columns a and b

    val mapping = to.map{ (x:(String, String)) => df(x._1).as(x._2) }.toArray
    df.select( mapping: _* )
  }
}

Remark 备注

If you need to convert a dataframe back to an RDD[Person], then 如果需要将数据帧转换回RDD [Person]，那么

val rdd : RDD[Row] = personsDF.rdd
val personsRDD : Rdd[Person] = rdd.map { r: Row => 
  Person( r.getAs("person"), r.getAs("age") )
}

Alternatives 备择方案

Have also a look at How to convert spark SchemaRDD into RDD of my case class? 还要看看如何将spark SchemaRDD转换为我的case类的RDD？

如何将Dataframe列名称与Scala案例类属性相匹配？

问题描述

1 个解决方案

解决方案1
8 已采纳 2015-09-12 10:55:41

Example code 示例代码

Remark 备注

Alternatives 备择方案

如何将Dataframe列名称与Scala案例类属性相匹配？

问题描述

1 个解决方案

解决方案1 8 已采纳 2015-09-12 10:55:41

Example code 示例代码

Remark 备注

Alternatives 备择方案

解决方案1
8 已采纳 2015-09-12 10:55:41