[英]Create a new calculated column on groupby in Pyspark
I have the following dataframe in Pyspark which is already inside a groupby by the column "accountname".我在 Pyspark 中有以下 dataframe,它已经在“帐户名”列的 groupby 中。
accountname | namespace | cost | cost_to_pay
account001 | ns1 | 93 | 9
account001 | Transversal | 93 | 25
account002 | ns2 | 50 | 27
account002 | Transversal | 50 | 12
I need a new column that is the "cost" - "cost_to_pay"
where "namespace" == "Transversal"
, I need this result in all the fields of the new column, something like this:我需要一个新列,即
"cost" - "cost_to_pay"
其中"namespace" == "Transversal"
,我需要在新列的所有字段中得到这个结果,如下所示:
accountname | namespace | cost | cost_to_pay | new_column1
account001 | ns1 | 93 | 9 | 68
account001 | Transversal | 93 | 25 | 68
account002 | ns2 | 50 | 27 | 38
account002 | Transversal | 50 | 12 | 38
68 is the result of subtracting 93 - 25 for the groupby from account001. 68 是从 account001 中减去 groupby 的 93 - 25 的结果。 And 38 the result of subtracting 50 - 12 for account002.
38 是 account002 减去 50 - 12 的结果。
Any idea how I can achieve this?知道如何实现这一目标吗?
You can get the difference for each accountname using the maximum of a masked difference:您可以使用掩码差异的最大值来获取每个帐户名的差异:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'new_column1',
F.max(
F.when(
F.col('namespace') == 'Transversal',
F.col('cost') - F.col('cost_to_pay')
)
).over(Window.partitionBy('accountname'))
)
df2.show()
+-----------+-----------+----+-----------+-----------+
|accountname| namespace|cost|cost_to_pay|new_column1|
+-----------+-----------+----+-----------+-----------+
| account001| ns1| 93| 9| 68|
| account001|Transversal| 93| 25| 68|
| account002| ns2| 50| 27| 38|
| account002|Transversal| 50| 12| 38|
+-----------+-----------+----+-----------+-----------+
if df
is your dataframe after groupby
, you can find a df_temp
using:如果
df
在groupby
之后是你的 dataframe ,你可以找到一个df_temp
使用:
df_temp = df.filter(F.col('namespace')=='Transversal')
df_temp = df_temp.withcolumn('new_column1', F.col('cost') - F.col('cost_to_pay'))
df_temp = df_temp.select('accountname', 'new_column1') ## keep only relevant columns
## you might want to have some extra checks, like droping duplicates, etc
## and finally join df_temp with you main dataframe df
df = df.join(df_temp, on='accountname', how='left')
df = df.na.fill({'accountname':'some predefined value, like 0}) ## if you wish to fill nulls
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.