简体   繁体   English

如何 F.when 基于可变数量的条件提供 pyspark

[英]How to F.when based on variable number of a conditions to supply with pyspark

I'm trying to build a series of F.when based on a variable number of conditions.我正在尝试基于可变数量的条件构建一系列F.when How can I build the logic below using a loop where I supply a list of items to test (ie [1,2,3] following the example below)?如何使用循环构建下面的逻辑,在该循环中提供要测试的项目列表(即[1,2,3]遵循下面的示例)?

The reason I ask is because I want to be able to build these conditions with a variable number of test items in the list.. the loop logic should build something like the below, but by passing a list of numbers to test, [1,2,3] .我问的原因是因为我希望能够在列表中使用可变数量的测试项来构建这些条件。循环逻辑应该构建如下所示的内容,但是通过传递一个数字列表来测试, [1,2,3]

F.when(F.col("test") == 1, "out_" + str(1) ).when(F.col("test") == 2, "out_" + str(2)).when(F.col("test") == 3, "out_" + str(3)).otherwise(-1)

I've tried to use reduce to do this, but haven't figure this out before.我曾尝试使用reduce来做到这一点,但之前没有弄清楚这一点。 Does anyone have any advice?有人有建议吗?

reduce(lambda x, i: x.when(F.col("test") == i , "out_" + str(i)),  
              output_df, 
              F).otherwise(-1)

My expected output should provide the same logic as the below:我预期的 output 应该提供与以下相同的逻辑:

Column<b'CASE WHEN (test = 1) THEN out_1 WHEN (test = 2) THEN out_2 WHEN (test = 3) THEN out_3 ELSE -1 END'>

You almost got it, you need to pass the list of test cases as the second parameter to the reduce function:你几乎明白了,你需要将测试用例列表作为第二个参数传递给reduce function:

from functools import reduce
import pyspark.sql.functions as F


tests = [1, 2, 3]

new_col = reduce(
    lambda x, i: x.when(F.col("test") == i, "out_" + str(i)),
    tests,
    F
).otherwise(-1)

print(new_col)

#Column<'CASE WHEN (test = 1) THEN out_1 WHEN (test = 2) THEN out_2 WHEN (test = 3) THEN out_3 ELSE -1 END'>

Since your check has the same value as your output but with appended out_ .由于您的支票与您的 output 具有相同的价值,但附加out_ You could check if the value is in the predefined list, and if it is just add out_ .您可以检查该值是否在预定义列表中,以及是否只是添加out_

Example:例子:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F


data = [
    {"test": 1},
    {"test": 2},
    {"test": 3},
    {"test": 4},
    {"test": 5},
]

test_ints = [1, 2, 3]

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data)
df = df.withColumn(
    "result",
    F.when(
        F.col("test").isin(test_ints),
        F.concat(F.lit("out_"), F.col("test")),
    ).otherwise(-1),
)

Result:结果:

+----+------+                                                                   
|test|result|
+----+------+
|1   |out_1 |
|2   |out_2 |
|3   |out_3 |
|4   |-1    |
|5   |-1    |
+----+------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM