基于键合并 Spark Dataframe 行

Question

Using pivot in pyspark I was able to get the below values.在 pyspark 中使用 pivot 我能够得到以下值。 Note that columns T1..T4 are dynamically generated from pivot output, therefore cannot predict if there will be more or less.请注意，列 T1..T4 是从 pivot output 动态生成的，因此无法预测是否会有更多或更少。

+--------------------+-----------+----------------+-------------+-------------+
|   ID               |T1         |          T2    | T3          |        T4   |
+--------------------+-----------+----------------+-------------+-------------+
|15964021641455171213|   0.000000|             0.0|        0E-10|23.1500000000|
|15964021641455171213|  55.560000|40.7440000000002|18.5200000000|        0E-10|
+--------------------+-----------+----------------+-------------+-------------+

Expected Result:预期结果：

+--------------------+-----------+----------------+-------------+-------------+
|   ID               |T1         |          T2    | T3          |        T4   |
+--------------------+-----------+----------------+-------------+-------------+
|15964021641455171213|  55.560000|40.7440000000002|18.5200000000|23.1500000000|
+--------------------+-----------+----------------+-------------+-------------+

Any help is appreciated !任何帮助表示赞赏！

Answer 1

The operation is a simple groupBy, with a sum as aggregation function.该操作是一个简单的 groupBy，总和为聚合 function。 The main issue is here that the names and number of columns to be summed up are unknown.这里的主要问题是要汇总的列的名称和数量是未知的。 Therefore the aggregation columns have to be calculated dynamically:因此必须动态计算聚合列：

from pyspark.sql import functions as F

df=...
non_id_cols=df.columns
non_id_cols.remove('ID')
summed_non_id_cols=[F.sum(c).alias(c) for c in non_id_cols]
df.groupBy('ID').agg(*summed_non_id_cols).show()

基于键合并 Spark Dataframe 行

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-09-17 15:48:41

基于键合并 Spark Dataframe 行

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-09-17 15:48:41

解决方案1
0 已采纳 2022-09-17 15:48:41