简体   繁体   English

Pyspark:如何处理python用户定义函数中的空值

[英]Pyspark: How to deal with null values in python user defined functions

I want to use some string similarity functions that are not native to pyspark such as the jaro and jaro-winkler measures on dataframes. 我想使用一些不是pyspark原生的字符串相似性函数,例如jaro和jaro-winkler对数据帧的度量。 These are readily available in python modules such as jellyfish . 这些在python模块中很容易获得,例如jellyfish I can write pyspark udf's fine for cases where there a no null values present, ie comparing cat to dog. 我可以写pyspark udf对于没有null值的情况,即比较猫与狗的情况。 when I apply these udf's to data where null values are present, it doesn't work. 当我将这些udf应用于存在null值的数据时,它不起作用。 In problems such as the one I'm solving it is very common for one of the strings to be null 在我正在解决的问题中,其中一个字符串为null是很常见的

I need help getting my string similarity udf to work in general, to be more specific, to work in cases where one of the values are null 我需要帮助使我的字符串相似性udf一般工作,更具体地说,在其中一个值为null情况下工作

I wrote a udf that works when there are no null values in the input data: 我写了一个udf,当输入数据中没有空值时,它可以工作:

from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
import pyspark.sql.functions as F
import jellyfish.cjellyfish

def jaro_winkler_func(df, column_left, column_right):

    jaro_winkler_udf = udf(f=lambda s1, s2: jellyfish.jaro_winkler(s1, s2), returnType=DoubleType())

    df = (df
          .withColumn('test',
                      jaro_winkler_udf(df[column_left], df[column_right])))

    return df

Example input and output: 示例输入和输出:

+-----------+------------+
|string_left|string_right|
+-----------+------------+
|       dude|         dud|
|       spud|         dud|
+-----------+------------+
+-----------+------------+------------------+
|string_left|string_right|              test|
+-----------+------------+------------------+
|       dude|         dud|0.9166666666666666|
|       spud|         dud|0.7222222222222222|
+-----------+------------+------------------+

When I run this on data that has a null value then I get the usual reams of spark errors, the most applicable one seems to be TypeError: str argument expected . 当我在具有空值的数据上运行时,我得到通常的火花错误,最适用的似乎是TypeError: str argument expected I assume this one is due to null values in the data since it worked when there were none. 我假设这个是由于数据中的null值,因为它没有时它工作。

I modified the function above to to check if both values are not null and only run the function if that's the case, otherwise return 0. 我修改了上面的函数来检查两个值是否都不为null,只有在这种情况下才运行函数,否则返回0。

from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
import pyspark.sql.functions as F
import jellyfish.cjellyfish

def jaro_winkler_func(df, column_left, column_right):

    jaro_winkler_udf = udf(f=lambda s1, s2: jellyfish.jaro_winkler(s1, s2), returnType=DoubleType())

    df = (df
       .withColumn('test',
                   F.when(df[column_left].isNotNull() & df[column_right].isNotNull(),
                          jaro_winkler_udf(df[column_left], df[column_right]))
                   .otherwise(0.0)))

    return df

However, I still get the same errors as before. 但是,我仍然得到与以前相同的错误。

Sample input and what I would like the output to be: 示例输入以及我希望输出的内容:

+-----------+------------+
|string_left|string_right|
+-----------+------------+
|       dude|         dud|
|       spud|         dud|
|       spud|        null|
|       null|        null|
+-----------+------------+
+-----------+------------+------------------+
|string_left|string_right|              test|
+-----------+------------+------------------+
|       dude|         dud|0.9166666666666666|
|       spud|         dud|0.7222222222222222|
|       spud|        null|0.0               |
|       null|        null|0.0               |
+-----------+------------+------------------+

We will modify a little bit your code and it should works fine : 我们将修改你的代码,它应该工作正常:

@udf(DoubleType())
def jaro_winkler(s1, s2):
    if not all((s1,s2)):
        out = 0
    else: 
        out = jellyfish.jaro_winkler(s1, s2)
    return out


def jaro_winkler_func(df, column_left, column_right):

    df = df.withColumn(
        'test',
        jaro_winkler(df[column_left], df[column_right]))
    )

    return df

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM