简体   繁体   中英

What is the proper way to define a Pandas UDF in a Palantir Foundry Code Repository

I would like to define the following pandas_udf in a Palantir Foundry code repository.

@pandas_udf("long", PandasUDFType.GROUPED_AGG)
def percentile_95_udf(v):
    return v.quantile(0.95)

But when I try to define this udf in the global scope, I get the error:

AttributeError: 'NoneType' object has no attribute '_jvm'

However, if I define this same function within a function called by my transform the code runs fine, as in:

from pyspark.sql import functions as F
from pyspark.sql.functions import pandas_udf, PandasUDFType
from transforms.api import transform, Input, Output

@transform(
    data_out=Output("output path"),
    data_in=Input("input path")
)
def percentile_95_transform(data_in, data_out):
    data_out.write_dataframe(percentile_95(data_in.dataframe()))

def percentile_95(df):
    @pandas_udf("long", PandasUDFType.GROUPED_AGG)
    def percentile_95_udf(v):
        return v.quantile(0.95)

    # group rows for each interface into 1 day periods
    grp_by = df.groupBy(df.objectId, F.window("TimeCaptured", "1 day"))

    stats = [
        percentile_95_udf(df.ReceivedWidgets),
        percentile_95_udf(df.TransmittedWidgets),
    ]
    result = grp_by.agg(*stats)

    cleaned = result.withColumn("Day", F.col("window").start).drop("window")
    return cleaned

Why does my pandas_udf not work in global scope but does work when defined within another function? Also, is there a better approach to defining pandas_udf? Defining it as a nested function is preventing me from being able to reuse my udf.

For reference, my code repository in Palantir Foundry has the following structure:

transforms-python
    conda_recipe
        meta.yaml
    src
        myproject
            datasets
                __init__.py
                percentile_95.py
            __init__.py
            pipeline.py
        setup.cfg
        setup.py

The reason has a root cause similar to this question: PySpark error: AttributeError: 'NoneType' object has no attribute '_jvm'

When you do calls at global level, you are trying to execute spark commands (via pandas, in your case) before spark is set up. When you do your call inside the transform, then spark is available so it works.

The main problem here is with the annotation itself being called at top level, and spark is only setup for when the transform runs. When you call it from within def percentile_95(df): you're actually calling the annotation from within a transform here:

@transform(
    data_out=Output("output path"),
    data_in=Input("input path")
)
def percentile_95_transform(data_in, data_out):
    data_out.write_dataframe(
                             percentile_95(  # <-- here we're inside a transform
                                           data_in.dataframe())) 

If you want to-reuse these UDFs in multiple places, maybe you could wrap them in a function or a class that you initialise inside each transform that you want to use.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM