当输入参数是从数据帧的两列连接的值时，Spark UDF 错误

Question

Following python code loads a csv file into dataframe df and sends a string value from single or multiple columns of df to UDF function testFunction(...) .以下 python 代码将csv文件加载到dataframe df ，并将字符串值从df的单个或多个列发送到UDF函数testFunction(...) 。 The code works fine if I send a single column value.如果我发送单个列值，则代码可以正常工作。 But if I send a value df.address + " " + df.city from two columns of df, I get the following error:但是，如果我从 df 的两列发送值df.address + " " + df.city ，则会收到以下错误：

Question : What I may be doing wrong, and how can we fix the issue?问题：我可能做错了什么，我们如何解决这个问题？ All the columns in df are NOT NULL so null or empty string should not be an I issue. df中的所有列都不是 NULL，因此 null 或空字符串不应该是 I 问题。 For example if I send single column value df.address, that value has blank spaces (eg 123 Main Street).例如，如果我发送单列值 df.address，则该值有空格（例如 123 Main Street）。 So, why the error when two columns' concatenated values are sent to UDF?那么，为什么将两列的连接值发送到 UDF 时会出错？

Error :错误：

PythonException: An exception was thrown from a UDF: 'AttributeError: 'NoneType' object has no attribute 'upper'' PythonException：从 UDF 引发异常：'AttributeError：'NoneType'对象没有属性'upper''

from pyspark.sql.types import StringType
from pyspark.sql import functions as F

df = spark.read.csv(".......dfs.core.windows.net/myDataFile.csv", header="true", inferSchema="true")

def testFunction(value):
  mystr = value.upper().replace(".", " ").replace(",", " ").replace("  ", " ").strip()
  return mystr

newFunction = F.udf(testFunction, StringType())

df2 = df.withColumn("myNewCol", newFunction(df.address + " " + df.city))
df2.show()

Answer 1

In PySpark you cannot concatenate StringType columns together using + .在 PySpark 中，您不能使用+将 StringType 列连接在一起。 It will return null which breaks your udf.它将返回null ，这会破坏您的 udf。 You can use concat instead.您可以改用concat 。

df2 = df.withColumn("myNewCol", newFunction(F.concat(df.address, F.lit(" "), df.city)))

当输入参数是从数据帧的两列连接的值时，Spark UDF 错误

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-05-21 18:39:37

当输入参数是从数据帧的两列连接的值时，Spark UDF 错误

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-05-21 18:39:37

解决方案1
1 已采纳 2022-05-21 18:39:37