[英]Spark UDF error when input parameter is a value concatenated from two columns of a dataframe
Following python code loads a csv
file into dataframe df
and sends a string value from single or multiple columns of df
to UDF
function testFunction(...)
.以下 python 代码将
csv
文件加载到dataframe df
,并将字符串值从df
的单个或多个列发送到UDF
函数testFunction(...)
。 The code works fine if I send a single column value.如果我发送单个列值,则代码可以正常工作。 But if I send a value
df.address + " " + df.city
from two columns of df, I get the following error:但是,如果我从 df 的两列发送值
df.address + " " + df.city
,则会收到以下错误:
Question : What I may be doing wrong, and how can we fix the issue?问题:我可能做错了什么,我们如何解决这个问题? All the columns in
df
are NOT NULL so null or empty string should not be an I issue. df
中的所有列都不是 NULL,因此 null 或空字符串不应该是 I 问题。 For example if I send single column value df.address, that value has blank spaces (eg 123 Main Street).例如,如果我发送单列值 df.address,则该值有空格(例如 123 Main Street)。 So, why the error when two columns' concatenated values are sent to UDF?
那么,为什么将两列的连接值发送到 UDF 时会出错?
Error :错误:
PythonException: An exception was thrown from a UDF: 'AttributeError: 'NoneType' object has no attribute 'upper''
PythonException:从 UDF 引发异常:'AttributeError:'NoneType'对象没有属性'upper''
from pyspark.sql.types import StringType
from pyspark.sql import functions as F
df = spark.read.csv(".......dfs.core.windows.net/myDataFile.csv", header="true", inferSchema="true")
def testFunction(value):
mystr = value.upper().replace(".", " ").replace(",", " ").replace(" ", " ").strip()
return mystr
newFunction = F.udf(testFunction, StringType())
df2 = df.withColumn("myNewCol", newFunction(df.address + " " + df.city))
df2.show()
In PySpark you cannot concatenate StringType columns together using +
.在 PySpark 中,您不能使用
+
将 StringType 列连接在一起。 It will return null
which breaks your udf.它将返回
null
,这会破坏您的 udf。 You can use concat
instead.您可以改用
concat
。
df2 = df.withColumn("myNewCol", newFunction(F.concat(df.address, F.lit(" "), df.city)))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.