简体   繁体   English

Pandas UDF,带有字典查找和条件

[英]Pandas UDF with dictionary lookup and conditionals

I want to use pandas_udf in Pyspark for certain transformations and calculations of column.我想在 Pyspark 中使用 pandas_udf 进行某些列的转换和计算。 And it seems that pandas udf can't be written exactly as normal UDFs.而且似乎 pandas udf 不能完全像普通 UDF 那样写。

An example function looks something like below:示例 function 如下所示:

def modify_some_column(example_column_1, example_column_2):

    lookup_dict = {'a' : 1, 'b' : 2, 'c' : 3,'d': 4, 'e' : 5} #can be anything

    if example_column_1 in lookup_dict:
        if(example_column_1 == 'a' and example_column_2 == "something"):
            return lookup_dict[example_column_1]
        
        elif(example_column_1 == 'a' and example_column_2 == "something else"):
        
            return "something else"
        else:
            return lookup_dict[example_column_1]    
    else:
        return ""

Basically, takes in two column values from a spark dataframe and returns a value which I intend to use with withColumn :基本上,从火花 dataframe 中获取两列值并返回我打算与withColumn一起使用的值:

modify_some_column_udf  = pandas_udf(modify_some_column, returnType= StringType())
df = df.withColumn('new_col',modify_property_type_udf(df.col_1,df.col_2))

But this does not work.但这不起作用。 How should I modify the above to be able to use it in pandas udf?我应该如何修改上述内容才能在 pandas udf 中使用它?

Edit: It is clear to me that the above conditions can be easily and efficiently be implemented using native PySpark functions.编辑:我很清楚,使用本机 PySpark 函数可以轻松有效地实现上述条件。 But I am looking to write the above logic using Pandas UDF.但我希望使用 Pandas UDF 编写上述逻辑。

With this simple if/else logic, you don't have to use UDF.有了这个简单的 if/else 逻辑,您就不必使用 UDF。 In fact you should avoid to use UDFs as much as possible.事实上,您应该尽可能避免使用 UDF。

Assuming you have the dataframe as follow假设您有 dataframe 如下

df = spark.createDataFrame([
    ('a', 'something'),
    ('a', 'something else'),
    ('c', None),
    ('c', ''),
    ('c', 'something'),
    ('c', 'something else'),
    ('c', 'blah'),
    ('f', 'blah'),
], ['c1', 'c2'])
df.show()

+---+--------------+
| c1|            c2|
+---+--------------+
|  a|     something|
|  a|something else|
|  c|          null|
|  c|              |
|  c|     something|
|  c|something else|
|  c|          blah|
|  f|          blah|
+---+--------------+

You can create a temporary lookup column and use it to check against other columns您可以创建一个临时查找列并使用它来检查其他列

import json
your_lookup_dict = {'a' : 1, 'b' : 2, 'c' : 3,'d': 4, 'e' : 5}

import pyspark.sql.functions as F

(df
    .withColumn('lookup', F.from_json(F.lit(json.dumps(your_lookup_dict)), 'map<string, string>'))
    .withColumn('mod', F
        .when((F.col('c1') == 'a') & (F.col('c2') == 'something'), F.col('lookup')[F.col('c1')])
        .when((F.col('c1') == 'a') & (F.col('c2') == 'something else'), F.lit('something else'))
        .otherwise(F.col('lookup')[F.col('c1')])
    )
    .show(10, False)
)

+---+--------------+----------------------------------------+--------------+
|c1 |c2            |lookup                                  |mod           |
+---+--------------+----------------------------------------+--------------+
|a  |something     |{a -> 1, b -> 2, c -> 3, d -> 4, e -> 5}|1             |
|a  |something else|{a -> 1, b -> 2, c -> 3, d -> 4, e -> 5}|something else|
|c  |null          |{a -> 1, b -> 2, c -> 3, d -> 4, e -> 5}|3             |
|c  |              |{a -> 1, b -> 2, c -> 3, d -> 4, e -> 5}|3             |
|c  |something     |{a -> 1, b -> 2, c -> 3, d -> 4, e -> 5}|3             |
|c  |something else|{a -> 1, b -> 2, c -> 3, d -> 4, e -> 5}|3             |
|c  |blah          |{a -> 1, b -> 2, c -> 3, d -> 4, e -> 5}|3             |
|f  |blah          |{a -> 1, b -> 2, c -> 3, d -> 4, e -> 5}|null          |
+---+--------------+----------------------------------------+--------------+

EDIT编辑

Since you insisted to use Pandas UDF, you'd have to understand that Pandas execute your dataframe by batches, so you'll have to wrap your functions to something like this既然你坚持使用 Pandas UDF,你必须明白 Pandas 执行你的 dataframe 批量,所以你必须将你的函数包装成这样的东西

def wrapper(iterator):
    def modify_some_column(example_column_1, example_column_2):
        lookup_dict = {'a' : 1, 'b' : 2, 'c' : 3,'d': 4, 'e' : 5} #can be anything
        if example_column_1 in lookup_dict:
            if(example_column_1 == 'a' and example_column_2 == "something"):
                return str(lookup_dict[example_column_1])
            elif(example_column_1 == 'a' and example_column_2 == "something else"):
                return "something else"
            else:
                return str(lookup_dict[example_column_1])    
        else:
            return ""

    for pdf in iterator:
        pdf['mod'] = pdf.apply(lambda r: modify_some_column(r['c1'], r['c2']), axis=1)
        yield pdf

df = df.withColumn('mod', F.lit('temp'))
df.mapInPandas(wrapper, df.schema).show()

+---+--------------+--------------+
| c1|            c2|           mod|
+---+--------------+--------------+
|  a|     something|             1|
|  a|something else|something else|
|  c|          null|             3|
|  c|              |             3|
|  c|     something|             3|
|  c|something else|             3|
|  c|          blah|             3|
|  f|          blah|              |
+---+--------------+--------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM