[英]How to convert Columns to rows in Spark scala or spark sql?
我有这样的数据。
+------+------+------+----------+----------+----------+----------+----------+----------+
| Col1 | Col2 | Col3 | Col1_cnt | Col2_cnt | Col3_cnt | Col1_wts | Col2_wts | Col3_wts |
+------+------+------+----------+----------+----------+----------+----------+----------+
| AAA | VVVV | SSSS | 3 | 4 | 5 | 0.5 | 0.4 | 0.6 |
| BBB | BBBB | TTTT | 3 | 4 | 5 | 0.5 | 0.4 | 0.6 |
| CCC | DDDD | YYYY | 3 | 4 | 5 | 0.5 | 0.4 | 0.6 |
+------+------+------+----------+----------+----------+----------+----------+----------+
我试过了,但我没有得到任何帮助。
val df = Seq(("G",Some(4),2,None),("H",None,4,Some(5))).toDF("A","X","Y", "Z")
我想以下表的形式输出
+-----------+---------+---------+
| Cols_name | Col_cnt | Col_wts |
+-----------+---------+---------+
| Col1 | 3 | 0.5 |
| Col2 | 4 | 0.4 |
| Col3 | 5 | 0.6 |
+-----------+---------+---------+
以下是转置DataFrame的一般方法:
c1
, c2
, c3
),将列名和关联值列组合成一个struct
(例如struct(lit(c1), c1_cnt, c1_wts)
) struct
-typed列插入,然后将其数组explode
-ed进行struct
列 struct
元素 以下示例代码已经通用化,以处理要转置的任意列的列表:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("AAA", "VVVV", "SSSS", 3, 4, 5, 0.5, 0.4, 0.6),
("BBB", "BBBB", "TTTT", 3, 4, 5, 0.5, 0.4, 0.6),
("CCC", "DDDD", "YYYY", 3, 4, 5, 0.5, 0.4, 0.6)
).toDF("c1", "c2", "c3", "c1_cnt", "c2_cnt", "c3_cnt", "c1_wts", "c2_wts", "c3_wts")
val pivotCols = Seq("c1", "c2", "c3")
val valueColSfx = Seq("_cnt", "_wts")
val arrStructs = pivotCols.map{ c => struct(
Seq(lit(c).as("_pvt")) ++
valueColSfx.map((c, _)).map{ case (p, s) => col(p + s).as(s) }: _*
).as(c + "_struct")
}
val valueColAgg = valueColSfx.map(s => first($"struct_col.$s").as(s + "_first"))
df.
select(array(arrStructs: _*).as("arr_structs")).
withColumn("struct_col", explode($"arr_structs")).
groupBy($"struct_col._pvt").agg(valueColAgg.head, valueColAgg.tail: _*).
show
// +----+----------+----------+
// |_pvt|_cnt_first|_wts_first|
// +----+----------+----------+
// | c1| 3| 0.5|
// | c3| 5| 0.6|
// | c2| 4| 0.4|
// +----+----------+----------+
请注意,函数first
在上面的示例中使用,但它可以是任何其他聚合函数(例如avg
, max
, collect_list
),具体取决于特定的业务需求。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.