简体   繁体   English

Pyspark:在UDF中传递多个列以及一个参数

[英]Pyspark: Pass multiple columns along with an argument in UDF

I am writing a udf which will take two of the dataframe columns along with an extra parameter (a constant value) and should add a new column to the dataframe. 我正在编写一个udf,它将使用两个dataframe列以及一个额外的参数(一个常量值),并且应该向dataframe中添加一个新列。 My function looks like: 我的功能看起来像:

def udf_test(column1, column2, constant_var):
    if column1 == column2:
        return column1
    else:
        return constant_var

also, I am doing the following to pass in multiple columns: 另外,我正在执行以下操作以传递多列:

apply_test = udf(udf_test, StringType())
df = df.withColumn('new_column', apply_test('column1', 'column2'))

This does not work right now unless I remove the constant_var as my functions third argument but I really need that. 除非我将constant_var删除为函数的第三个参数,否则此操作现在不起作用,但我确实需要它。 So I have tried to do something like the following: 因此,我尝试执行以下操作:

constant_var = 'TEST'
apply_test = udf(lambda x: udf_test(x, constant_var), StringType())
df = df.withColumn('new_column', apply_test(constant_var)(col('column1', 'column2')))

and

apply_test = udf(lambda x,y: udf_test(x, y, constant_var), StringType())

None of the above have worked for me. 以上都不对我有用。 I got those ideas based on this and this stackoverflow posts and I think it is obvious how my question is different from both of the. 我基于以及这些 stackoverflow帖子获得了这些想法,并且我认为我的问题与两者之间的区别是显而易见的。 Any help would be much appreciated. 任何帮助将非常感激。

NOTE: I have simplified the function here just for the sake of discussion and the actual function is more complex. 注意:我在这里只是为了讨论而简化了功能,而实际功能却更为复杂。 I know this operation could be done using when and otherwise statements. 我知道可以使用whenotherwise语句完成此操作。

You do not have to use an user-defined function. 您不必使用用户定义的函数。 You can use the functions when() and otherwise() : 您可以使用when()else()函数:

from pyspark.sql import functions as f
df = df.withColumn('new_column', 
                   f.when(f.col('col1') == f.col('col2'), f.col('col1'))
                    .otherwise('other_value'))

Another way to do it is to generate a user-defined function. 另一种方法是生成用户定义的函数。 However, using udf 's has a negative impact on the performance since the data must be (de)serialized to and from python. 但是,使用udf对性能产生负面影响,因为必须将数据与python进行反序列化。 To generate a user-defined function, you need a function that returns a (user-defined) function. 要生成用户定义的函数,您需要一个返回(用户定义的)函数的函数。 For example: 例如:

def generate_udf(constant_var):
    def test(col1, col2):
        if col1 == col2:
            return col1
        else:
            return constant_var
    return f.udf(test, StringType())

df = df.withColumn('new_column', 
                   generate_udf('default_value')(f.col('col1'), f.col('col2')))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM