简体   繁体   English

在 Scala 中将 DataFrame 转换为 RDD[Map]

[英]Convert DataFrame to RDD[Map] in Scala

I want to convert an array created like:我想转换一个像这样创建的数组:

case class Student(name: String, age: Int)
val dataFrame: DataFrame = sql.createDataFrame(sql.sparkContext.parallelize(List(Student("Torcuato", 27), Student("Rosalinda", 34))))

When I collect the results from the DataFrame, the resulting array is an Array[org.apache.spark.sql.Row] = Array([Torcuato,27], [Rosalinda,34])当我从 DataFrame 收集结果时,结果数组是一个Array[org.apache.spark.sql.Row] = Array([Torcuato,27], [Rosalinda,34])

I'm looking into converting the DataFrame in an RDD[Map] eg:我正在考虑在 RDD[Map] 中转换 DataFrame,例如:

Map("name" -> nameOFFirst, "age" -> ageOfFirst)
Map("name" -> nameOFsecond, "age" -> ageOfsecond)

I tried to use map via: x._1 but that does not seem to work for Array [spark.sql.row] How can I anyway perform the transformation?我尝试通过以下方式使用 map: x._1但这似乎不适用于Array [spark.sql.row]我怎样才能执行转换?

You can use map function with pattern matching to do the job here您可以使用带有模式匹配的 map 函数来完成这里的工作

import org.apache.spark.sql.Row

dataFrame
  .map { case Row(name, age) => Map("name" -> name, "age" -> age) }

This will result in RDD[Map[String, Any]]这将导致RDD[Map[String, Any]]

In other words, you could transform row of dataframe to map, and below works!换句话说,您可以将数据框的行转换为映射,以下有效!

def dfToMapOfRdd(df: DataFrame): RDD[Map[String, Any]] = {
    val result: RDD[Map[String, Any]] = df.rdd.map(row => {
        row.getValuesMap[Any](row.schema.fieldNames)
    })
    result
}

refs: https://stackoverflow.com/a/46156025/6494418参考文献: https : //stackoverflow.com/a/46156025/6494418

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM