简体   繁体   English

如何在 Pyspark 中的条件下动态链接?

[英]How to dynamically chain when conditions in Pyspark?

Context语境

A dataframe should have the category column, which is based on a set of fixed rules.数据框应具有category列,该列基于一组固定规则。 The set of rules becomes quite large.规则集变得相当大。

Question

Is there a way to use a list of tuples (see example below) to dynamically chain the when conditions to achieve the same result as hard coded solution at the bottom.有没有办法使用元组列表(参见下面的示例)来动态链接when条件以实现与底部硬编码解决方案相同的结果。

# Potential list of rule definitions
category_rules = [
    ('A', 8, 'small'),
    ('A', 30, 'large'),
    ('B', 5, 'small'),
    # Group, size smaller value --> Category
    # and so on ... e.g.,
]

Example例子

Here is a toy example for reproducibility.这是一个重现性的玩具示例。 A dataframe consisting of groups and ids should have the column category added, which depends on the content of the group column.由组和 id 组成的数据框应该添加列category ,这取决于group列的内容。 The list of rules is shown in the section above.规则列表如上一节所示。

Input data 输入数据
df = df.withColumn(
    'category',
    F.when(
        (F.col('group') == 'A')
        & (F.col('size') < 8),
        F.lit('small')
    ).when(
        (F.col('group') == 'A')
        & (F.col('size') < 30),
        F.lit('large')
    ).when(
        (F.col('group') == 'B')
        & (F.col('size') < 5),
        F.lit('small')
    ).otherwise(
        F.lit('unkown')
    )
)
+-----+-----+----+--------+
|group|   id|size|category|
+-----+-----+----+--------+
|    A|45345|   5|   small|
|    C|55345|   5|  unkown|
|    A|35345|  10|   large|
|    B|65345|   4|   small|
+-----+-----+----+--------+
Hard coded solution 硬编码解决方案
df = df.withColumn( 'category', F.when( (F.col('group') == 'A') & (F.col('size') < 8), F.lit('small') ).when( (F.col('group') == 'A') & (F.col('size') < 30), F.lit('large') ).when( (F.col('group') == 'B') & (F.col('size') < 5), F.lit('small') ).otherwise( F.lit('unkown') ) )
 +-----+-----+----+--------+ |group| id|size|category| +-----+-----+----+--------+ | A|45345| 5| small| | C|55345| 5| unkown| | A|35345| 10| large| | B|65345| 4| small| +-----+-----+----+--------+

[Edit 1] Add more complex conditions to explain why chaining is needed. [编辑 1] 添加更复杂的条件来解释为什么需要链接。

A solution based on the dataframe api:基于dataframe api的解决方案:

cond = F.when(F.col('group') == category_rules[0][0], F.lit(category_rules[0][1]))
for c in category_rules[1:]:
    cond = cond.when(F.col('group') == c[0], F.lit(c[1]))
cond = cond.otherwise('unknown')

df.withColumn("category", cond).show()

You can use string interpolation to create an expression such as:您可以使用字符串插值来创建表达式,例如:

CASE 
   WHEN (group = 'A') THEN 'small' 
   WHEN (group = 'B') THEN 'large'
   ELSE 'unkown'
END

And then use it in Spark expression:然后在 Spark 表达式中使用它:

from pyspark.sql.functions import expr

data = [('A', '45345'), ('C', '55345'), ('A', '35345'), ('B', '65345')]
df = spark.createDataFrame(data, ['group', 'id'])

category_rules = [('A', 'small'), ('B', 'large')]

when_cases = [f"WHEN (group = '{r[0]}') THEN '{r[1]}'" for r in category_rules]

rules_expr = "CASE " + " ".join(when_cases) + " ELSE 'unkown' END"
# CASE WHEN (group = 'A') THEN 'small' WHEN (group = 'B') THEN 'large' ELSE 'unkown' END

df.withColumn('category', expr(rules_expr)).show()

# +-----+-----+--------+
# |group|   id|category|
# +-----+-----+--------+
# |    A|45345|   small|
# |    C|55345|  unkown|
# |    A|35345|   small|
# |    B|65345|   large|
# +-----+-----+--------+

I hope this solution fits you:我希望这个解决方案适合你:

Create a new dataframe with the list of tuples you define with the columns 'group' and 'category': category_rules = [('A', 'small'),('B', 'large'), etc] This will be your 'lookup_dataframe'使用您使用“group”和“category”列定义的元组列表创建一个新数据框: category_rules = [('A', 'small'),('B', 'large'), etc] 这将是你的“lookup_dataframe”

lookup_df = spark.createDataFrame(category_rules , ['group', 'category'])

Then you can left join both dataframes on the column 'group', so for every row with a group value will get the category value in the column you joined from the lookup_df.然后,您可以在“组”列上加入两个数据框,因此对于具有组值的每一行,都将获得您从 lookup_df 加入的列中的类别值。

df = df.join(lookup_dataframe, ['group'], 'left')

By making a left join, if there is a group value in your df (on the right side) that´s not included in the lookup_df, like 'C', it will have a null value.通过进行左连接,如果您的 df(右侧)中有一个未包含在 lookup_df 中的组值,例如“C”,它将具有空值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM