將列表作為參數傳遞給 udf 方法

Question

使用文本處理庫https://github.com/berknology/text-preprocessing

我想將 preprocess_functions 作為參數傳遞給 preprocess_text 方法

使用以下示例：

def preprocess_text_spark(df: SparkDataFrame, 
                          target_column: str, 
                          preprocessed_column_name: str = 'preprocessed_text'
                         ) -> SparkDataFrame:


 """ Preprocess text in a column of a PySpark DataFrame by leveraging PySpark UDF to preprocess text in parallel """



preprocess_functions = [to_lower, remove_email, remove_url, remove_punctuation, remove_special_character, normalize_unicode,  remove_number, remove_whitespace, remove_stopword, lemmatize_word, stem_word, check_spelling] 
_preprocess_text = udf(preprocess_text, StringType())
new_df = df.withColumn(preprocessed_column_name, _preprocess_text(df[target_column],preprocess_functions))
return new_df

這是我得到的錯誤：

TypeError: Invalid argument, not a string or column: [<function to_lower at 0x7f33f9a865f0>, <function remove_email at 0x7f33f9a93c20>, <function remove_url at 0x7f33f9a933b0>, <function remove_punctuation at 0x7f33f9a934d0>, <function remove_special_character at 0x7f33f9a935f0>, <function normalize_unicode at 0x7f33f9a93a70>, <function remove_number at 0x7f33f9a93170>, <function remove_whitespace at 0x7f33f9a93830>, <function remove_stopword at 0x7f33f9a93b00>, <function lemmatize_word at 0x7f33f9a8d4d0>, <function stem_word at 0x7f33f9a8d3b0>, <function check_spelling at 0x7f33f9a8d170>] of type <class 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

我試圖將 preprocess_functions 轉換為數組並點亮但沒有結果

我該如何解決這個問題？

Answer 1

Spark udf 不能將函數作為輸入，它只接受列或字符串表示的列名。 Take a look at the sample here https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.udf.html?highlight=udf#pyspark.sql.functions.udf

將列表作為參數傳遞給 udf 方法

問題描述

1 個解決方案

解決方案1
0 2021-05-16 22:33:01

將列表作為參數傳遞給 udf 方法

問題描述

1 個解決方案

解決方案1 0 2021-05-16 22:33:01

解決方案1
0 2021-05-16 22:33:01