如何在 Pyspark 中按字段对每个组的相同值求和

Question

I have a dataframe called "df" which is in groupby by the "accountname" field, each entry in this column has a cost that can be the same or different, I need to add it only when it is different.我有一个名为“df”的 dataframe，它在 groupby 中的“accountname”字段中，此列中的每个条目的成本可以相同或不同，只有在不同时才需要添加。 This is the original df:这是原始的df：

accountname |   namespace   |   cost    
account001  |   ns1         |   11      
account001  |   ns1         |   11      
account001  |   ns1         |   11      
account001  |   ns1         |   11      
account001  |   ns2         |   10      
account001  |   ns2         |   10      
account002  |   ns3         |   50      
account002  |   ns3         |   50      
account002  |   ns3         |   50      
account003  |   ns4         |   5

The only entry that has different costs within the "accountname" field is "account001", I only need to add 11 + 10 once. “accountname”字段中唯一具有不同成本的条目是“account001”，我只需添加11 + 10一次。 And I need to get something like this:我需要得到这样的东西：

accountname |   namespace   |   cost    |   cost_to_pay
account001  |   ns1         |   11      |   21
account001  |   ns1         |   11      |   21
account001  |   ns1         |   11      |   21
account001  |   ns1         |   11      |   21
account001  |   ns2         |   10      |   21
account001  |   ns2         |   10      |   21
account002  |   ns3         |   50      |   50
account002  |   ns3         |   50      |   50
account002  |   ns3         |   50      |   50
account003  |   ns4         |   5       |   5

Any idea how to do it?知道怎么做吗？ Thanks in advance.提前致谢。

Answer 1

You can use collect_set over window partitioned by accountname to get distinct cost values, then sum the elements of the resulting array using aggregate function:您可以使用按帐户名分区的collect_set上的accountname来获得不同的成本值，然后使用aggregate function 对结果数组的元素求和：

from pyspark.sql import functions as F

df1 = df.withColumn(
    "cost_to_pay",
    F.expr("aggregate(collect_set(cost) over(partition by accountname), 0D, (acc, x) -> acc + x)")
)

df1.show()
#+-----------+---------+----+-----------+
#|accountname|namespace|cost|cost_to_pay|
#+-----------+---------+----+-----------+
#| account003|      ns4|   5|          5|
#| account001|      ns1|  11|         21|
#| account001|      ns1|  11|         21|
#| account001|      ns1|  11|         21|
#| account001|      ns1|  11|         21|
#| account001|      ns2|  10|         21|
#| account001|      ns2|  10|         21|
#| account002|      ns3|  50|         50|
#| account002|      ns3|  50|         50|
#| account002|      ns3|  50|         50|
#+-----------+---------+----+-----------+

Answer 2

You can remove duplicates using distinct , group by accountname and sum the cost, and join back to the original dataframe using accountname :您可以使用distinct删除重复项，按accountname分组并对成本求和，然后使用accountname加入原始 dataframe ：

import pyspark.sql.functions as F

df2 = (df.dropDuplicates(['accountname', 'namespace', 'cost'])
         .groupBy('accountname')
         .agg(F.sum('cost').alias('cost_to_pay'))
         .join(df, 'accountname')
         .select('accountname', 'namespace', 'cost', 'cost_to_pay')
      )

df2.show()
+-----------+---------+----+-----------+
|accountname|namespace|cost|cost_to_pay|
+-----------+---------+----+-----------+
| account001|      ns1|  11|         21|
| account001|      ns1|  11|         21|
| account001|      ns1|  11|         21|
| account001|      ns1|  11|         21|
| account001|      ns2|  10|         21|
| account001|      ns2|  10|         21|
| account002|      ns3|  50|         50|
| account002|      ns3|  50|         50|
| account002|      ns3|  50|         50|
| account003|      ns4|   5|          5|
+-----------+---------+----+-----------+

如何在 Pyspark 中按字段对每个组的相同值求和

问题描述

2 个解决方案

解决方案1
2 2021-02-26 16:03:22

解决方案2
1 已采纳 2021-02-26 15:31:22

如何在 Pyspark 中按字段对每个组的相同值求和

问题描述

2 个解决方案

解决方案1 2 2021-02-26 16:03:22

解决方案2 1 已采纳 2021-02-26 15:31:22

解决方案1
2 2021-02-26 16:03:22

解决方案2
1 已采纳 2021-02-26 15:31:22