如何将列转换为Spark scala或spark sql中的行？

Question

I have the Data like this. 我有这样的数据。

+------+------+------+----------+----------+----------+----------+----------+----------+
| Col1 | Col2 | Col3 | Col1_cnt | Col2_cnt | Col3_cnt | Col1_wts | Col2_wts | Col3_wts |
+------+------+------+----------+----------+----------+----------+----------+----------+
| AAA  | VVVV | SSSS |        3 |        4 |        5 |      0.5 |      0.4 |      0.6 |
| BBB  | BBBB | TTTT |        3 |        4 |        5 |      0.5 |      0.4 |      0.6 |
| CCC  | DDDD | YYYY |        3 |        4 |        5 |      0.5 |      0.4 |      0.6 |
+------+------+------+----------+----------+----------+----------+----------+----------+

I have tried but I am not getting any help here. 我试过了，但我没有得到任何帮助。

val df = Seq(("G",Some(4),2,None),("H",None,4,Some(5))).toDF("A","X","Y", "Z")

I want the output in the form of below table 我想以下表的形式输出

+-----------+---------+---------+
| Cols_name | Col_cnt | Col_wts |
+-----------+---------+---------+
| Col1      |       3 |     0.5 |
| Col2      |       4 |     0.4 |
| Col3      |       5 |     0.6 |
+-----------+---------+---------+

Answer 1

Here's a general approach for transposing a DataFrame: 以下是转置DataFrame的一般方法：

For each of the pivot columns (say c1 , c2 , c3 ), combine the column name and associated value columns into a struct (eg struct(lit(c1), c1_cnt, c1_wts) ) 对于每个枢轴列（例如c1 ， c2 ， c3 ），将列名和关联值列组合成一个struct （例如struct(lit(c1), c1_cnt, c1_wts) ）
Put all these struct -typed columns into an array which is then explode -ed into rows of struct columns 把所有这些struct -typed列插入，然后将其数组explode -ed进行struct列
Group by the pivot column name to aggregate the associated struct elements 按透视列名称分组以聚合关联的struct元素

The following sample code has been generalized to handle an arbitrary list of columns to be transposed: 以下示例代码已经通用化，以处理要转置的任意列的列表：

import org.apache.spark.sql.functions._
import spark.implicits._

val df = Seq(
  ("AAA", "VVVV", "SSSS", 3, 4, 5, 0.5, 0.4, 0.6),
  ("BBB", "BBBB", "TTTT", 3, 4, 5, 0.5, 0.4, 0.6),
  ("CCC", "DDDD", "YYYY", 3, 4, 5, 0.5, 0.4, 0.6)
).toDF("c1", "c2", "c3", "c1_cnt", "c2_cnt", "c3_cnt", "c1_wts", "c2_wts", "c3_wts")

val pivotCols = Seq("c1", "c2", "c3")

val valueColSfx = Seq("_cnt", "_wts")

val arrStructs = pivotCols.map{ c => struct(
    Seq(lit(c).as("_pvt")) ++
      valueColSfx.map((c, _)).map{ case (p, s) => col(p + s).as(s) }: _*
  ).as(c + "_struct")
}

val valueColAgg = valueColSfx.map(s => first($"struct_col.$s").as(s + "_first"))

df.
  select(array(arrStructs: _*).as("arr_structs")).
  withColumn("struct_col", explode($"arr_structs")).
  groupBy($"struct_col._pvt").agg(valueColAgg.head, valueColAgg.tail: _*).
  show
// +----+----------+----------+
// |_pvt|_cnt_first|_wts_first|
// +----+----------+----------+
// |  c1|         3|       0.5|
// |  c3|         5|       0.6|
// |  c2|         4|       0.4|
// +----+----------+----------+

Note that function first is used in the above example, but it could be any other aggregate function (eg avg , max , collect_list ) depending on the specific business requirement. 请注意，函数first在上面的示例中使用，但它可以是任何其他聚合函数（例如avg ， max ， collect_list ），具体取决于特定的业务需求。

如何将列转换为Spark scala或spark sql中的行？

问题描述

1 个解决方案

解决方案1
0 2019-04-26 22:05:18

如何将列转换为Spark scala或spark sql中的行？

问题描述

1 个解决方案

解决方案1 0 2019-04-26 22:05:18

解决方案1
0 2019-04-26 22:05:18