Pyspark，编写循环根据不同条件创建多个新列

Question

Lets say I have a Pyspark DataFrame with the following columns:假设我有一个 Pyspark DataFrame 具有以下列：

user, score, country, risky/safe, payment_id用户、分数、国家、风险/安全、payment_id

I made a list of thresholds: [10, 20, 30]我列出了阈值：[10,20,30]

Now I want to make a new columns for each threshold:现在我想为每个阈值创建一个新列：

% of risky payments with score above the threshold out of all payments (risky and safe)在所有付款中得分高于阈值的风险付款的百分比（风险和安全）
% of risky distinct users with at least one score above the threshold out of all users (risky and safe)在所有用户中至少有一个分数高于阈值的有风险的不同用户的百分比（有风险的和安全的）

both of them should be grouped by country.两者都应按国家/地区分组。

The result should be something like this:结果应该是这样的：

Country | % payments thresh 10 | % users thresh 10 | % payments thresh 20 ... 
A
B
C

I was able to make it work with an external for loop but I want it to be all in one dataframe.我能够使其与外部 for 循环一起工作，但我希望它全部在一个 dataframe 中。

thresholds = [10, 20, 30]


for thresh in thresholds:

    
df = (df
     .select('country', 'risk/safe', 'user', 'payment')
     .where(F.col('risk\safe') == 'risk')
     .groupBy('country').agg(F.sum(F.when(
         (F.col('score') >= thresh),1 
           )) / F.count('country').alias('% payments'))

Answer 1

Use a list comprehension within the agg() .在agg()中使用列表推导。

pay_aggs = [(func.sum((func.col('score')>=thresh).cast('int'))/func.count('country')).alias('% pay '+str(thresh)) for thresh in thresholds]
user_aggs = [(func.countDistinct(func.when(func.col('score')>=thresh, func.col('user')))/func.countDistinct('user')).alias('% user '+str(thresh)) for thresh in thresholds]

df. \
    select('country', 'risk/safe', 'user', 'payment'). \
    where(func.col('risk\safe') == 'risk'). \
    groupBy('country'). \
    agg(*pay_aggs, *user_aggs)

The pay_aggs list will generate the following aggregations (you can easily print the list) pay_aggs列表将生成以下聚合（您可以轻松打印列表）

# [Column<'(sum(CAST((score >= 10) AS INT)) / count(country)) AS `% pay 10`'>,
#  Column<'(sum(CAST((score >= 20) AS INT)) / count(country)) AS `% pay 20`'>,
#  Column<'(sum(CAST((score >= 30) AS INT)) / count(country)) AS `% pay 30`'>]

Pyspark，编写循环根据不同条件创建多个新列

问题描述

1 个解决方案

解决方案1
1 2022-08-09 06:46:14

Pyspark，编写循环根据不同条件创建多个新列

问题描述

1 个解决方案

解决方案1 1 2022-08-09 06:46:14

解决方案1
1 2022-08-09 06:46:14