A dataframe should have the category
column, which is based on a set of fixed rules. The set of rules becomes quite large.
Is there a way to use a list of tuples (see example below) to dynamically chain the when
conditions to achieve the same result as hard coded solution at the bottom.
# Potential list of rule definitions
category_rules = [
('A', 8, 'small'),
('A', 30, 'large'),
('B', 5, 'small'),
# Group, size smaller value --> Category
# and so on ... e.g.,
]
Here is a toy example for reproducibility. A dataframe consisting of groups and ids should have the column category
added, which depends on the content of the group
column. The list of rules is shown in the section above.
df = df.withColumn(
'category',
F.when(
(F.col('group') == 'A')
& (F.col('size') < 8),
F.lit('small')
).when(
(F.col('group') == 'A')
& (F.col('size') < 30),
F.lit('large')
).when(
(F.col('group') == 'B')
& (F.col('size') < 5),
F.lit('small')
).otherwise(
F.lit('unkown')
)
)
+-----+-----+----+--------+
|group| id|size|category|
+-----+-----+----+--------+
| A|45345| 5| small|
| C|55345| 5| unkown|
| A|35345| 10| large|
| B|65345| 4| small|
+-----+-----+----+--------+
Hard coded solution
df = df.withColumn( 'category', F.when( (F.col('group') == 'A') & (F.col('size') < 8), F.lit('small') ).when( (F.col('group') == 'A') & (F.col('size') < 30), F.lit('large') ).when( (F.col('group') == 'B') & (F.col('size') < 5), F.lit('small') ).otherwise( F.lit('unkown') ) )
+-----+-----+----+--------+ |group| id|size|category| +-----+-----+----+--------+ | A|45345| 5| small| | C|55345| 5| unkown| | A|35345| 10| large| | B|65345| 4| small| +-----+-----+----+--------+
[Edit 1] Add more complex conditions to explain why chaining is needed.
A solution based on the dataframe api:
cond = F.when(F.col('group') == category_rules[0][0], F.lit(category_rules[0][1]))
for c in category_rules[1:]:
cond = cond.when(F.col('group') == c[0], F.lit(c[1]))
cond = cond.otherwise('unknown')
df.withColumn("category", cond).show()
You can use string interpolation to create an expression such as:
CASE
WHEN (group = 'A') THEN 'small'
WHEN (group = 'B') THEN 'large'
ELSE 'unkown'
END
And then use it in Spark expression:
from pyspark.sql.functions import expr
data = [('A', '45345'), ('C', '55345'), ('A', '35345'), ('B', '65345')]
df = spark.createDataFrame(data, ['group', 'id'])
category_rules = [('A', 'small'), ('B', 'large')]
when_cases = [f"WHEN (group = '{r[0]}') THEN '{r[1]}'" for r in category_rules]
rules_expr = "CASE " + " ".join(when_cases) + " ELSE 'unkown' END"
# CASE WHEN (group = 'A') THEN 'small' WHEN (group = 'B') THEN 'large' ELSE 'unkown' END
df.withColumn('category', expr(rules_expr)).show()
# +-----+-----+--------+
# |group| id|category|
# +-----+-----+--------+
# | A|45345| small|
# | C|55345| unkown|
# | A|35345| small|
# | B|65345| large|
# +-----+-----+--------+
I hope this solution fits you:
Create a new dataframe with the list of tuples you define with the columns 'group' and 'category': category_rules = [('A', 'small'),('B', 'large'), etc] This will be your 'lookup_dataframe'
lookup_df = spark.createDataFrame(category_rules , ['group', 'category'])
Then you can left join both dataframes on the column 'group', so for every row with a group value will get the category value in the column you joined from the lookup_df.
df = df.join(lookup_dataframe, ['group'], 'left')
By making a left join, if there is a group value in your df (on the right side) that´s not included in the lookup_df, like 'C', it will have a null value.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.