在 Palantir Foundry 代码存储库中定义 Pandas UDF 的正确方法是什么

Question

I would like to define the following pandas_udf in a Palantir Foundry code repository.我想在 Palantir Foundry 代码库中定义以下 pandas_udf。

@pandas_udf("long", PandasUDFType.GROUPED_AGG)
def percentile_95_udf(v):
    return v.quantile(0.95)

But when I try to define this udf in the global scope, I get the error:但是当我尝试在全局 scope 中定义这个 udf 时，我得到了错误：

AttributeError: 'NoneType' object has no attribute '_jvm'

However, if I define this same function within a function called by my transform the code runs fine, as in:但是，如果我在我的转换调用的 function 中定义相同的 function ，则代码运行良好，如下所示：

from pyspark.sql import functions as F
from pyspark.sql.functions import pandas_udf, PandasUDFType
from transforms.api import transform, Input, Output

@transform(
    data_out=Output("output path"),
    data_in=Input("input path")
)
def percentile_95_transform(data_in, data_out):
    data_out.write_dataframe(percentile_95(data_in.dataframe()))

def percentile_95(df):
    @pandas_udf("long", PandasUDFType.GROUPED_AGG)
    def percentile_95_udf(v):
        return v.quantile(0.95)

    # group rows for each interface into 1 day periods
    grp_by = df.groupBy(df.objectId, F.window("TimeCaptured", "1 day"))

    stats = [
        percentile_95_udf(df.ReceivedWidgets),
        percentile_95_udf(df.TransmittedWidgets),
    ]
    result = grp_by.agg(*stats)

    cleaned = result.withColumn("Day", F.col("window").start).drop("window")
    return cleaned

Why does my pandas_udf not work in global scope but does work when defined within another function?为什么我的 pandas_udf 在全局 scope 中不起作用，但在另一个 function 中定义时起作用？ Also, is there a better approach to defining pandas_udf?另外，是否有更好的方法来定义 pandas_udf？ Defining it as a nested function is preventing me from being able to reuse my udf.将其定义为嵌套的 function 会阻止我重用我的 udf。

For reference, my code repository in Palantir Foundry has the following structure:作为参考，我在 Palantir Foundry 中的代码存储库具有以下结构：

transforms-python
    conda_recipe
        meta.yaml
    src
        myproject
            datasets
                __init__.py
                percentile_95.py
            __init__.py
            pipeline.py
        setup.cfg
        setup.py

Answer 1

The reason has a root cause similar to this question: PySpark error: AttributeError: 'NoneType' object has no attribute '_jvm'原因有一个类似于这个问题的根本原因： PySpark error: AttributeError: 'NoneType' object has no attribute '_jvm'

When you do calls at global level, you are trying to execute spark commands (via pandas, in your case) before spark is set up.当您在全局级别进行调用时，您正在尝试在设置 spark 之前执行 spark 命令（在您的情况下是通过 pandas）。 When you do your call inside the transform, then spark is available so it works.当您在转换内部进行调用时，火花可用，因此可以正常工作。

The main problem here is with the annotation itself being called at top level, and spark is only setup for when the transform runs.这里的主要问题是注释本身在顶层被调用，并且 spark 仅在转换运行时设置。 When you call it from within def percentile_95(df): you're actually calling the annotation from within a transform here:当您从def percentile_95(df):中调用它时，您实际上是从此处的转换中调用注释：

@transform(
    data_out=Output("output path"),
    data_in=Input("input path")
)
def percentile_95_transform(data_in, data_out):
    data_out.write_dataframe(
                             percentile_95(  # <-- here we're inside a transform
                                           data_in.dataframe()))

If you want to-reuse these UDFs in multiple places, maybe you could wrap them in a function or a class that you initialise inside each transform that you want to use.如果您想在多个地方重用这些 UDF，也许您可以将它们包装在 function 或 class 中，您在要使用的每个转换中初始化它们。

在 Palantir Foundry 代码存储库中定义 Pandas UDF 的正确方法是什么

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-03-03 08:59:20

在 Palantir Foundry 代码存储库中定义 Pandas UDF 的正确方法是什么

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-03-03 08:59:20

解决方案1
0 已采纳 2021-03-03 08:59:20