简体   繁体   English

在 PySpark Pandas UDF 中指定用户定义的 Function 的正确方法

[英]Correct Way to Specify User-Defined Function in PySpark Pandas UDF

I am using pyspark 2.4.2, so the per the docs for this version one can do this to create a GROUPED_MAP:我正在使用 pyspark 2.4.2,因此根据此版本的文档,可以执行此操作来创建 GROUPED_MAP:

from pyspark.sql.functions import pandas_udf, PandasUDFType
df = spark.createDataFrame([(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],("id", "v"))

@pandas_udf(returnType="id long, v double", functionType=PandasUDFType.GROUPED_MAP)
def subtract_mean(pdf):
    v = pdf.v
    return pdf.assign(v=v - v.mean())

df.groupby("id").apply(subtract_mean).show()

This works but you cannot call subtract_mean as a normal python function that is passed a pandas DataFrame.这可行,但您不能将subtract_mean称为普通的python function,它通过了pandas ZBA834BA01217A3788E459。 But if you do this, it will work:但如果你这样做,它将起作用:

def subtract_mean(pdf):
    v = pdf.v
    return pdf.assign(v=v - v.mean())

sub_spark = pandas_udf(f=subtract_mean, returnType="id long, v double", functionType=PandasUDFType.GROUPED_MAP)

df.groupby("id").apply(sub_spark).show()

Now you can call subtract_mean from python passing a pandas DataFrame.现在您可以通过 pandas DataFrame 从 python 调用subtract_mean均值。 How does one do this using the annotation approach?如何使用注释方法做到这一点? It is not clear from the docs how to do this.从文档中不清楚如何做到这一点。 What function is annotated and what function is given for the f parameter?注释了哪些 function 以及为f参数给出了哪些 function?

The two ways are equivalent for specifying an UDF.这两种方法等效于指定 UDF。 The decorator approach is just a neater way of doing things.装饰器方法只是一种更整洁的做事方式。 The function that follows the decorator is passed as the f parameter.装饰器后面的 function 作为f参数传递。

As described in this answer , you can use subtract_mean.__wrapped__ to get back the original undecorated function.this answer中所述,您可以使用subtract_mean.__wrapped__来取回原始未修饰的function。 The second approach in your question is more readble though.不过,您问题中的第二种方法更具可读性。 Using __wrapped__ makes the code less readable.使用__wrapped__会降低代码的可读性。 But if it's just for unit testing purposes, it should be fine.但如果它只是用于单元测试目的,那应该没问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM