Pyspark：在UDF中傳遞多個列以及一個參數

Question

我正在編寫一個udf，它將使用兩個dataframe列以及一個額外的參數（一個常量值），並且應該向dataframe中添加一個新列。 我的功能看起來像：

def udf_test(column1, column2, constant_var):
    if column1 == column2:
        return column1
    else:
        return constant_var

另外，我正在執行以下操作以傳遞多列：

apply_test = udf(udf_test, StringType())
df = df.withColumn('new_column', apply_test('column1', 'column2'))

除非我將constant_var刪除為函數的第三個參數，否則此操作現在不起作用，但我確實需要它。 因此，我嘗試執行以下操作：

constant_var = 'TEST'
apply_test = udf(lambda x: udf_test(x, constant_var), StringType())
df = df.withColumn('new_column', apply_test(constant_var)(col('column1', 'column2')))

和

apply_test = udf(lambda x,y: udf_test(x, y, constant_var), StringType())

以上都不對我有用。 我基於此以及這些 stackoverflow帖子獲得了這些想法，並且我認為我的問題與兩者之間的區別是顯而易見的。 任何幫助將非常感激。

注意：我在這里只是為了討論而簡化了功能，而實際功能卻更為復雜。 我知道可以使用when和otherwise語句完成此操作。

Answer 1

您不必使用用戶定義的函數。 您可以使用when（）和else（）函數：

from pyspark.sql import functions as f
df = df.withColumn('new_column', 
                   f.when(f.col('col1') == f.col('col2'), f.col('col1'))
                    .otherwise('other_value'))

另一種方法是生成用戶定義的函數。 但是，使用udf對性能產生負面影響，因為必須將數據與python進行反序列化。 要生成用戶定義的函數，您需要一個返回（用戶定義的）函數的函數。 例如：

def generate_udf(constant_var):
    def test(col1, col2):
        if col1 == col2:
            return col1
        else:
            return constant_var
    return f.udf(test, StringType())

df = df.withColumn('new_column', 
                   generate_udf('default_value')(f.col('col1'), f.col('col2')))

Pyspark：在UDF中傳遞多個列以及一個參數

問題描述

1 個解決方案

解決方案1
6 已采納 2018-10-16 20:41:16

Pyspark：在UDF中傳遞多個列以及一個參數

問題描述

1 個解決方案

解決方案1 6 已采納 2018-10-16 20:41:16

解決方案1
6 已采納 2018-10-16 20:41:16