pyspark 条件累计和

Question

我有一个 pyspark dataframe 有两个日期 - 账单和付款日期。 我想创建一个列，其中包含在该行的账单日期之前计费和支付的账单金额的总和。 此外，这需要为每个买家 ID 单独完成。 例子：

买方	账单日期	付款日期	数量	新列
1	2021-02-02	2021-02-20	100	0
1	2021-03-02	2021-03-10	400	100
1	2021-04-02	2021-05-25	500	500
1	2021-05-02	2021-06-03	300	500
1	2021-06-02	2021-07-20	200	1000
2	2021-04-10	2021-05-25	1000	0
2	2021-05-11	2021-06-03	3000	0
2	2021-06-15	2021-07-20	2000	4000

Pandas 相当于我正在寻找的是：

def to_value(row):
    return dt[(dt['pay_dt']<row['pay_dt'])&(dt['pay_dt']<row['bill_dt'])&(dt['buyer_id']==row['buyer_id'])].amount.sum()

dt['new_col']=dt.apply(to_value,axis=1)

Answer 1

您可以使用pandas_udf()并在那里进行条件处理：

import pandas as pd
import pyspark.sql.functions as F
from pyspark.sql import SparkSession, Window
from pyspark.sql.types import IntegerType

def conditional_sum(data: pd.DataFrame) -> int:
    df = data.apply(pd.Series)  # transform dict into separate columns
    return df.loc[df['Bill date'].max() > df['Payment Date']]['Amount'].sum()

udf_conditional_sum = F.pandas_udf(conditional_sum, IntegerType())

w = Window.partitionBy("Buyer").orderBy("Bill date").rowsBetween(Window.unboundedPreceding, Window.currentRow)

(
    df
    .withColumn("Conditional sum", udf_conditional_sum(F.struct("Bill date", "Payment Date", "Amount")).over(w))
    .show(truncate=False)
)

pyspark 条件累计和

问题描述

1 个解决方案

解决方案1
0 2022-08-08 12:49:02

pyspark 条件累计和

问题描述

1 个解决方案

解决方案1 0 2022-08-08 12:49:02

解决方案1
0 2022-08-08 12:49:02