[英]How to sum the same value per group by field in Pyspark
I have a dataframe called "df" which is in groupby by the "accountname" field, each entry in this column has a cost that can be the same or different, I need to add it only when it is different.我有一个名为“df”的 dataframe,它在 groupby 中的“accountname”字段中,此列中的每个条目的成本可以相同或不同,只有在不同时才需要添加。 This is the original df:
这是原始的df:
accountname | namespace | cost
account001 | ns1 | 11
account001 | ns1 | 11
account001 | ns1 | 11
account001 | ns1 | 11
account001 | ns2 | 10
account001 | ns2 | 10
account002 | ns3 | 50
account002 | ns3 | 50
account002 | ns3 | 50
account003 | ns4 | 5
The only entry that has different costs within the "accountname" field is "account001", I only need to add 11 + 10 once. “accountname”字段中唯一具有不同成本的条目是“account001”,我只需添加11 + 10一次。 And I need to get something like this:
我需要得到这样的东西:
accountname | namespace | cost | cost_to_pay
account001 | ns1 | 11 | 21
account001 | ns1 | 11 | 21
account001 | ns1 | 11 | 21
account001 | ns1 | 11 | 21
account001 | ns2 | 10 | 21
account001 | ns2 | 10 | 21
account002 | ns3 | 50 | 50
account002 | ns3 | 50 | 50
account002 | ns3 | 50 | 50
account003 | ns4 | 5 | 5
Any idea how to do it?知道怎么做吗? Thanks in advance.
提前致谢。
You can use collect_set
over window partitioned by accountname
to get distinct cost values, then sum the elements of the resulting array using aggregate
function:您可以使用按帐户名分区的
collect_set
上的accountname
来获得不同的成本值,然后使用aggregate
function 对结果数组的元素求和:
from pyspark.sql import functions as F
df1 = df.withColumn(
"cost_to_pay",
F.expr("aggregate(collect_set(cost) over(partition by accountname), 0D, (acc, x) -> acc + x)")
)
df1.show()
#+-----------+---------+----+-----------+
#|accountname|namespace|cost|cost_to_pay|
#+-----------+---------+----+-----------+
#| account003| ns4| 5| 5|
#| account001| ns1| 11| 21|
#| account001| ns1| 11| 21|
#| account001| ns1| 11| 21|
#| account001| ns1| 11| 21|
#| account001| ns2| 10| 21|
#| account001| ns2| 10| 21|
#| account002| ns3| 50| 50|
#| account002| ns3| 50| 50|
#| account002| ns3| 50| 50|
#+-----------+---------+----+-----------+
You can remove duplicates using distinct
, group by accountname
and sum the cost, and join back to the original dataframe using accountname
:您可以使用
distinct
删除重复项,按accountname
分组并对成本求和,然后使用accountname
加入原始 dataframe :
import pyspark.sql.functions as F
df2 = (df.dropDuplicates(['accountname', 'namespace', 'cost'])
.groupBy('accountname')
.agg(F.sum('cost').alias('cost_to_pay'))
.join(df, 'accountname')
.select('accountname', 'namespace', 'cost', 'cost_to_pay')
)
df2.show()
+-----------+---------+----+-----------+
|accountname|namespace|cost|cost_to_pay|
+-----------+---------+----+-----------+
| account001| ns1| 11| 21|
| account001| ns1| 11| 21|
| account001| ns1| 11| 21|
| account001| ns1| 11| 21|
| account001| ns2| 10| 21|
| account001| ns2| 10| 21|
| account002| ns3| 50| 50|
| account002| ns3| 50| 50|
| account002| ns3| 50| 50|
| account003| ns4| 5| 5|
+-----------+---------+----+-----------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.