How to dynamically chain when conditions in Pyspark?

Question

Context

A dataframe should have the category column, which is based on a set of fixed rules. The set of rules becomes quite large.

Question

Is there a way to use a list of tuples (see example below) to dynamically chain the when conditions to achieve the same result as hard coded solution at the bottom.

# Potential list of rule definitions
category_rules = [
    ('A', 8, 'small'),
    ('A', 30, 'large'),
    ('B', 5, 'small'),
    # Group, size smaller value --> Category
    # and so on ... e.g.,
]

Example

Here is a toy example for reproducibility. A dataframe consisting of groups and ids should have the column category added, which depends on the content of the group column. The list of rules is shown in the section above.

Input data

df = df.withColumn(
    'category',
    F.when(
        (F.col('group') == 'A')
        & (F.col('size') < 8),
        F.lit('small')
    ).when(
        (F.col('group') == 'A')
        & (F.col('size') < 30),
        F.lit('large')
    ).when(
        (F.col('group') == 'B')
        & (F.col('size') < 5),
        F.lit('small')
    ).otherwise(
        F.lit('unkown')
    )
)

+-----+-----+----+--------+
|group|   id|size|category|
+-----+-----+----+--------+
|    A|45345|   5|   small|
|    C|55345|   5|  unkown|
|    A|35345|  10|   large|
|    B|65345|   4|   small|
+-----+-----+----+--------+

Hard coded solution

df = df.withColumn( 'category', F.when( (F.col('group') == 'A') & (F.col('size') < 8), F.lit('small') ).when( (F.col('group') == 'A') & (F.col('size') < 30), F.lit('large') ).when( (F.col('group') == 'B') & (F.col('size') < 5), F.lit('small') ).otherwise( F.lit('unkown') ) )

 +-----+-----+----+--------+ |group| id|size|category| +-----+-----+----+--------+ | A|45345| 5| small| | C|55345| 5| unkown| | A|35345| 10| large| | B|65345| 4| small| +-----+-----+----+--------+

[Edit 1] Add more complex conditions to explain why chaining is needed.

Answer 1

A solution based on the dataframe api:

cond = F.when(F.col('group') == category_rules[0][0], F.lit(category_rules[0][1]))
for c in category_rules[1:]:
    cond = cond.when(F.col('group') == c[0], F.lit(c[1]))
cond = cond.otherwise('unknown')

df.withColumn("category", cond).show()

Answer 2

You can use string interpolation to create an expression such as:

CASE 
   WHEN (group = 'A') THEN 'small' 
   WHEN (group = 'B') THEN 'large'
   ELSE 'unkown'
END

And then use it in Spark expression:

from pyspark.sql.functions import expr

data = [('A', '45345'), ('C', '55345'), ('A', '35345'), ('B', '65345')]
df = spark.createDataFrame(data, ['group', 'id'])

category_rules = [('A', 'small'), ('B', 'large')]

when_cases = [f"WHEN (group = '{r[0]}') THEN '{r[1]}'" for r in category_rules]

rules_expr = "CASE " + " ".join(when_cases) + " ELSE 'unkown' END"
# CASE WHEN (group = 'A') THEN 'small' WHEN (group = 'B') THEN 'large' ELSE 'unkown' END

df.withColumn('category', expr(rules_expr)).show()

# +-----+-----+--------+
# |group|   id|category|
# +-----+-----+--------+
# |    A|45345|   small|
# |    C|55345|  unkown|
# |    A|35345|   small|
# |    B|65345|   large|
# +-----+-----+--------+

Answer 3

I hope this solution fits you:

Create a new dataframe with the list of tuples you define with the columns 'group' and 'category': category_rules = [('A', 'small'),('B', 'large'), etc] This will be your 'lookup_dataframe'

lookup_df = spark.createDataFrame(category_rules , ['group', 'category'])

Then you can left join both dataframes on the column 'group', so for every row with a group value will get the category value in the column you joined from the lookup_df.

df = df.join(lookup_dataframe, ['group'], 'left')

By making a left join, if there is a group value in your df (on the right side) that´s not included in the lookup_df, like 'C', it will have a null value.

How to dynamically chain when conditions in Pyspark?

Question

Context

Question

Example

3 answers

solution1
3 ACCPTED 2020-10-15 16:25:27

solution2
2 2020-10-15 16:16:37

solution3
0 2020-10-15 15:59:13

How to dynamically chain when conditions in Pyspark?

Question

Context

Question

Example

3 answers

solution1 3 ACCPTED 2020-10-15 16:25:27

solution2 2 2020-10-15 16:16:37

solution3 0 2020-10-15 15:59:13

solution1
3 ACCPTED 2020-10-15 16:25:27

solution2
2 2020-10-15 16:16:37

solution3
0 2020-10-15 15:59:13