Pyspark, writing a loop to create multiple new columns based on different conditions

Question

Lets say I have a Pyspark DataFrame with the following columns:

user, score, country, risky/safe, payment_id

I made a list of thresholds: [10, 20, 30]

Now I want to make a new columns for each threshold:

% of risky payments with score above the threshold out of all payments (risky and safe)
% of risky distinct users with at least one score above the threshold out of all users (risky and safe)

both of them should be grouped by country.

The result should be something like this:

Country | % payments thresh 10 | % users thresh 10 | % payments thresh 20 ... 
A
B
C

I was able to make it work with an external for loop but I want it to be all in one dataframe.

thresholds = [10, 20, 30]


for thresh in thresholds:

    
df = (df
     .select('country', 'risk/safe', 'user', 'payment')
     .where(F.col('risk\safe') == 'risk')
     .groupBy('country').agg(F.sum(F.when(
         (F.col('score') >= thresh),1 
           )) / F.count('country').alias('% payments'))

Answer 1

Use a list comprehension within the agg() .

pay_aggs = [(func.sum((func.col('score')>=thresh).cast('int'))/func.count('country')).alias('% pay '+str(thresh)) for thresh in thresholds]
user_aggs = [(func.countDistinct(func.when(func.col('score')>=thresh, func.col('user')))/func.countDistinct('user')).alias('% user '+str(thresh)) for thresh in thresholds]

df. \
    select('country', 'risk/safe', 'user', 'payment'). \
    where(func.col('risk\safe') == 'risk'). \
    groupBy('country'). \
    agg(*pay_aggs, *user_aggs)

The pay_aggs list will generate the following aggregations (you can easily print the list)

# [Column<'(sum(CAST((score >= 10) AS INT)) / count(country)) AS `% pay 10`'>,
#  Column<'(sum(CAST((score >= 20) AS INT)) / count(country)) AS `% pay 20`'>,
#  Column<'(sum(CAST((score >= 30) AS INT)) / count(country)) AS `% pay 30`'>]

Pyspark, writing a loop to create multiple new columns based on different conditions

Question

1 answers

solution1
1 2022-08-09 06:46:14

Pyspark, writing a loop to create multiple new columns based on different conditions

Question

1 answers

solution1 1 2022-08-09 06:46:14

solution1
1 2022-08-09 06:46:14