在 Pyspark 中的 groupby 上创建一个新的计算列

Question

I have the following dataframe in Pyspark which is already inside a groupby by the column "accountname".我在 Pyspark 中有以下 dataframe，它已经在“帐户名”列的 groupby 中。

accountname |   namespace   |   cost    |   cost_to_pay
account001  |   ns1         |   93      |   9
account001  |   Transversal |   93      |   25
account002  |   ns2         |   50      |   27
account002  |   Transversal |   50      |   12

I need a new column that is the "cost" - "cost_to_pay" where "namespace" == "Transversal" , I need this result in all the fields of the new column, something like this:我需要一个新列，即"cost" - "cost_to_pay"其中"namespace" == "Transversal" ，我需要在新列的所有字段中得到这个结果，如下所示：

accountname |   namespace   |   cost    |   cost_to_pay |   new_column1                                         
account001  |   ns1         |   93      |   9           |   68                    
account001  |   Transversal |   93      |   25          |   68
account002  |   ns2         |   50      |   27          |   38
account002  |   Transversal |   50      |   12          |   38

68 is the result of subtracting 93 - 25 for the groupby from account001. 68 是从 account001 中减去 groupby 的 93 - 25 的结果。 And 38 the result of subtracting 50 - 12 for account002. 38 是 account002 减去 50 - 12 的结果。

Any idea how I can achieve this?知道如何实现这一目标吗？

Answer 1

You can get the difference for each accountname using the maximum of a masked difference:您可以使用掩码差异的最大值来获取每个帐户名的差异：

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'new_column1',
    F.max(
        F.when(
            F.col('namespace') == 'Transversal',
            F.col('cost') - F.col('cost_to_pay')
        )
    ).over(Window.partitionBy('accountname'))
)

df2.show()
+-----------+-----------+----+-----------+-----------+
|accountname|  namespace|cost|cost_to_pay|new_column1|
+-----------+-----------+----+-----------+-----------+
| account001|        ns1|  93|          9|         68|
| account001|Transversal|  93|         25|         68|
| account002|        ns2|  50|         27|         38|
| account002|Transversal|  50|         12|         38|
+-----------+-----------+----+-----------+-----------+

Answer 2

if df is your dataframe after groupby , you can find a df_temp using:如果df在groupby之后是你的 dataframe ，你可以找到一个df_temp使用：

df_temp = df.filter(F.col('namespace')=='Transversal')
df_temp = df_temp.withcolumn('new_column1', F.col('cost') - F.col('cost_to_pay'))
df_temp = df_temp.select('accountname', 'new_column1') ## keep only relevant columns
## you might want to have some extra checks, like droping duplicates, etc

## and finally join df_temp with you main dataframe df
df = df.join(df_temp, on='accountname', how='left')
df = df.na.fill({'accountname':'some predefined value, like 0}) ## if you wish to fill nulls

在 Pyspark 中的 groupby 上创建一个新的计算列

问题描述

2 个解决方案

解决方案1
2 已采纳 2021-02-25 15:13:49

解决方案2
1 2021-02-25 15:14:16

在 Pyspark 中的 groupby 上创建一个新的计算列

问题描述

2 个解决方案

解决方案1 2 已采纳 2021-02-25 15:13:49

解决方案2 1 2021-02-25 15:14:16

解决方案1
2 已采纳 2021-02-25 15:13:49

解决方案2
1 2021-02-25 15:14:16