简体   繁体   English

当输入参数是从数据帧的两列连接的值时,Spark UDF 错误

[英]Spark UDF error when input parameter is a value concatenated from two columns of a dataframe

Following python code loads a csv file into dataframe df and sends a string value from single or multiple columns of df to UDF function testFunction(...) .以下 python 代码将csv文件加载到dataframe df ,并将字符串值从df的单个或多个列发送到UDF函数testFunction(...) The code works fine if I send a single column value.如果我发送单个列值,则代码可以正常工作。 But if I send a value df.address + " " + df.city from two columns of df, I get the following error:但是,如果我从 df 的两列发送值df.address + " " + df.city ,则会收到以下错误:

Question : What I may be doing wrong, and how can we fix the issue?问题:我可能做错了什么,我们如何解决这个问题? All the columns in df are NOT NULL so null or empty string should not be an I issue. df中的所有列都不是 NULL,因此 null 或空字符串不应该是 I 问题。 For example if I send single column value df.address, that value has blank spaces (eg 123 Main Street).例如,如果我发送单列值 df.address,则该值有空格(例如 123 Main Street)。 So, why the error when two columns' concatenated values are sent to UDF?那么,为什么将两列的连接值发送到 UDF 时会出错?

Error :错误

PythonException: An exception was thrown from a UDF: 'AttributeError: 'NoneType' object has no attribute 'upper'' PythonException:从 UDF 引发异常:'AttributeError:'NoneType'对象没有属性'upper''

from pyspark.sql.types import StringType
from pyspark.sql import functions as F

df = spark.read.csv(".......dfs.core.windows.net/myDataFile.csv", header="true", inferSchema="true")

def testFunction(value):
  mystr = value.upper().replace(".", " ").replace(",", " ").replace("  ", " ").strip()
  return mystr

newFunction = F.udf(testFunction, StringType())

df2 = df.withColumn("myNewCol", newFunction(df.address + " " + df.city))
df2.show()

In PySpark you cannot concatenate StringType columns together using + .在 PySpark 中,您不能使用+将 StringType 列连接在一起。 It will return null which breaks your udf.它将返回null ,这会破坏您的 udf。 You can use concat instead.您可以改用concat

df2 = df.withColumn("myNewCol", newFunction(F.concat(df.address, F.lit(" "), df.city)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark Udf 函数在输入中带有 Dataframe - Spark Udf function with Dataframe in input 现有功能用作UDF修改Spark数据框列时出错 - Error when existing function is used as UDF to modify a Spark Dataframe Column 如何将DataFrame作为输入传递给Spark UDF? - How to pass DataFrame as input to Spark UDF? 如何将 DataFrame 作为输入传递给 Spark UDF? - How to pass DataFrame as input to Spark UDF? 使用 Spark UDF 从 spark dataframe 中选择 integer / 带符号浮点值的小数部分 - Using Spark UDF to pick integer / decimal part of signed float value from spark dataframe 来自Python包的函数用于Spark数据帧的udf() - Functions from Python packages for udf() of Spark dataframe 从 Python panda 中的两列创建数据框时出错 - Error when creating dataframe from two columns in Python panda 连接 dataframe 列的总和,或同一列中的两列之间的差异 dataframe - concatenated dataframe sum of a column, or difference between two columns in the same dataframe Apache Spark -- 将 UDF 的结果分配给多个数据框列 - Apache Spark -- Assign the result of UDF to multiple dataframe columns 如何 output 计数来自 Spark dataframe 的两个二进制列的所有成对组合的计数,即使它是零计数? - How to output the count of all pairwise combination of two binary columns from a Spark dataframe even when it is zero count?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM