如何将Spark的DataFrame转换为嵌套的DataFrame

Question

I have a DataFrame with 6 columns like this: 我有一个像这样的6列的DataFrame：

df.printSchema
root
 |-- d1: string (nullable = true)
 |-- d2: string (nullable = true)
 |-- d3: string (nullable = true)
 |-- m1: string (nullable = true)
 |-- m2: string (nullable = true)
 |-- m3: string (nullable = true)

For some reasons, I'd like to convert it to like this: 由于某些原因，我想将其转换为如下形式：

root
 |-- d1: string (nullable = true)
 |-- d2: string (nullable = true)
 |-- d3: string (nullable = true)
 |-- metric: nested
     |-- m1: string (nullable = true)
     |-- m2: string (nullable = true)
     |-- m3: string (nullable = true)

I spent hours but I can't figure it out. 我花了几个小时，但无法弄清楚。 What I did so far is below 我到目前为止所做的如下

case class Metric(m1: String, m2: String, m3: String)
case class Dimension(d1: String, d2: String, d3: String, metric: Metric)

scala> df.map(row => Dimension(row.getAs[String]("d1"),
     |   row.getAs[String]("d2"),
     |   row.getAs[String]("d3"),
     |   Metric(row.getAs[String]("m1"),
     |       row.getAs[String]("m2"),
     |       row.getAs[String]("m3"))))
res48: org.apache.spark.rdd.RDD[Dimension] = MapPartitionsRDD[32] at map at <console>:46

scala> df.map(row => Dimension(row.getAs[String]("d1"),
     |   row.getAs[String]("d2"),
     |   row.getAs[String]("d3"),
     |   Metric(row.getAs[String]("m1"),
     |       row.getAs[String]("m2"),
     |       row.getAs[String]("m3")))).collect().foreach(println)
WARN scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 (TID 220, hostname): java.lang.ClassNotFoundException: $line55.$read$$iwC$$iwC$Dimension

scala> df.map(row => Dimension(row.getAs[String]("d1"),
     |   row.getAs[String]("d2"),
     |   row.getAs[String]("d3"),
     |   Metric(row.getAs[String]("m1"),
     |       row.getAs[String]("m2"),
     |       row.getAs[String]("m3")))).toDF
res50: org.apache.spark.sql.DataFrame = [d1: string, d2: string, d3: string, metric: struct<m1:string,m2:string,m3:string>]

scala> df.map(row => Dimension(row.getAs[String]("d1"),
     |   row.getAs[String]("d2"),
     |   row.getAs[String]("d3"),
     |   Metric(row.getAs[String]("m1"),
     |       row.getAs[String]("m2"),
     |       row.getAs[String]("m3")))).toDF.select("d1").show()
ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerSQLExecutionStart(1,show at <console>:51,org.apache.spark.sql.DataFrame.show(DataFrame.scala:319)

Please, help me. 请帮我。 Thanks. 谢谢。

Answer 1

Required imports: 所需进口：

// SQLContext in Spark 1.x
val spark: SparkSession = ???

import org.apache.spark.sql.functions.struct
import spark.implicits._

import sqlContext.implicits._ // Spark 1.x

Simple select: 简单选择：

df.select($"d1", $"d2", $"d3", struct($"m1", $"m2", $"m3").alias("metrics"))

followed by (Spark 2.x): 其次是（火花2.x）：

.as[Dimension]

if you want a statically Dataset[Dimension] instead of DataFrame . 如果要静态使用Dataset[Dimension]而不是DataFrame 。

如何将Spark的DataFrame转换为嵌套的DataFrame

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-01-12 15:26:31

如何将Spark的DataFrame转换为嵌套的DataFrame

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-01-12 15:26:31

解决方案1
2 已采纳 2017-01-12 15:26:31