简体   繁体   English

Databricks 火花 UDF 不适用于过滤的 dataframe

[英]Databricks spark UDF not working on filtered dataframe

I came across an issue in Databricks with Pyspark and I'm trying to understand why this implementation is not working, if I'm missing something conceptual here.我在使用 Pyspark 的 Databricks 中遇到了一个问题,如果我在这里遗漏了一些概念性的东西,我试图理解为什么这个实现不起作用。 What I'm trying to do is to run a UDF on a column in a dataframe but only on non-null values.我想要做的是在 dataframe 的列上运行 UDF,但只能在非空值上运行。

If I replace the lstrip_udf call with a fix value like "Val123" then it works fine, however it doesn't work for UDF.如果我将 lstrip_udf 调用替换为“Val123”之类的固定值,那么它可以正常工作,但它不适用于 UDF。 If I implement a null check inside the UDF with a bit different implementation then it works too.如果我在 UDF 中实现 null 检查,但实现方式略有不同,那么它也可以工作。 But even though there is the when and the IsNotNull it still throws the below error.但即使有whenIsNotNull它仍然会引发以下错误。

Can someone explain why or what am I missing here to make this work?有人可以解释为什么或我在这里缺少什么来完成这项工作吗?

Code:代码:

from pyspark.sql.types import StructType, StructField, IntegerType, StringType, BooleanType, TimestampType
inputschema = StructType([StructField("testcol", StringType(), True),
                          StructField("testcol2", StringType(), True)
                         ]
                        )
inputfile = spark.createDataFrame([("012121212","Ref #1"),
                                   ("0034343434","Ref #2"),
                                   ("0034343434","Ref #3"),
                                   (None,"Ref #4"),
                                   (None,"Ref #5"),
                                   ("00998877","Ref #6")
                                  ],
                                  schema = inputschema
                                 )
#display(inputfile)

from pyspark.sql.functions import col, when, lit
column_name = "testcol"
lstrip_udf = udf(lambda s: s.lstrip().lstrip("0"), StringType())
outputfile = (inputfile.withColumn(column_name,
                                  when(col(column_name).isNotNull(),
                                       lstrip_udf(col(column_name)) #replace this line with "Val123" and it works
                                      )
                                 ))
display(outputfile)

Error:错误:

File "<command-3701821159856508>", line 18, in <lambda>
AttributeError: 'NoneType' object has no attribute 'lstrip'

Thanks谢谢

It's probably a bug in Spark, so here is a minor modification to the UDF which solves the problem:这可能是 Spark 中的一个错误,所以这里是对 UDF 的一个小修改,它解决了这个问题:

lstrip_udf = udf(lambda s: s.lstrip().lstrip("0") if s is not None else None, StringType())

Or you can use Spark SQL to do this, which is more efficient than using UDF:或者您可以使用 Spark SQL 来执行此操作,这比使用 UDF 更有效:

outputfile = (
    inputfile.withColumn(column_name,
        F.when(col(column_name).isNotNull(),
            F.expr("ltrim('0', ltrim('',testcol))")
        )
    )
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM