简体   繁体   English

基于键合并 Spark Dataframe 行

[英]Merge Spark Dataframe rows based on key

Using pivot in pyspark I was able to get the below values.在 pyspark 中使用 pivot 我能够得到以下值。 Note that columns T1..T4 are dynamically generated from pivot output, therefore cannot predict if there will be more or less.请注意,列 T1..T4 是从 pivot output 动态生成的,因此无法预测是否会有更多或更少。

+--------------------+-----------+----------------+-------------+-------------+
|   ID               |T1         |          T2    | T3          |        T4   |
+--------------------+-----------+----------------+-------------+-------------+
|15964021641455171213|   0.000000|             0.0|        0E-10|23.1500000000|
|15964021641455171213|  55.560000|40.7440000000002|18.5200000000|        0E-10|
+--------------------+-----------+----------------+-------------+-------------+

Expected Result:预期结果:

+--------------------+-----------+----------------+-------------+-------------+
|   ID               |T1         |          T2    | T3          |        T4   |
+--------------------+-----------+----------------+-------------+-------------+
|15964021641455171213|  55.560000|40.7440000000002|18.5200000000|23.1500000000|
+--------------------+-----------+----------------+-------------+-------------+

Any help is appreciated !任何帮助表示赞赏!

The operation is a simple groupBy, with a sum as aggregation function.该操作是一个简单的 groupBy,总和为聚合 function。 The main issue is here that the names and number of columns to be summed up are unknown.这里的主要问题是要汇总的列的名称和数量是未知的。 Therefore the aggregation columns have to be calculated dynamically:因此必须动态计算聚合列:

from pyspark.sql import functions as F

df=...
non_id_cols=df.columns
non_id_cols.remove('ID')
summed_non_id_cols=[F.sum(c).alias(c) for c in non_id_cols]
df.groupBy('ID').agg(*summed_non_id_cols).show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM