I would like to define the following pandas_udf in a Palantir Foundry code repository.
@pandas_udf("long", PandasUDFType.GROUPED_AGG)
def percentile_95_udf(v):
return v.quantile(0.95)
But when I try to define this udf in the global scope, I get the error:
AttributeError: 'NoneType' object has no attribute '_jvm'
However, if I define this same function within a function called by my transform the code runs fine, as in:
from pyspark.sql import functions as F
from pyspark.sql.functions import pandas_udf, PandasUDFType
from transforms.api import transform, Input, Output
@transform(
data_out=Output("output path"),
data_in=Input("input path")
)
def percentile_95_transform(data_in, data_out):
data_out.write_dataframe(percentile_95(data_in.dataframe()))
def percentile_95(df):
@pandas_udf("long", PandasUDFType.GROUPED_AGG)
def percentile_95_udf(v):
return v.quantile(0.95)
# group rows for each interface into 1 day periods
grp_by = df.groupBy(df.objectId, F.window("TimeCaptured", "1 day"))
stats = [
percentile_95_udf(df.ReceivedWidgets),
percentile_95_udf(df.TransmittedWidgets),
]
result = grp_by.agg(*stats)
cleaned = result.withColumn("Day", F.col("window").start).drop("window")
return cleaned
Why does my pandas_udf not work in global scope but does work when defined within another function? Also, is there a better approach to defining pandas_udf? Defining it as a nested function is preventing me from being able to reuse my udf.
For reference, my code repository in Palantir Foundry has the following structure:
transforms-python
conda_recipe
meta.yaml
src
myproject
datasets
__init__.py
percentile_95.py
__init__.py
pipeline.py
setup.cfg
setup.py
The reason has a root cause similar to this question: PySpark error: AttributeError: 'NoneType' object has no attribute '_jvm'
When you do calls at global level, you are trying to execute spark commands (via pandas, in your case) before spark is set up. When you do your call inside the transform, then spark is available so it works.
The main problem here is with the annotation itself being called at top level, and spark is only setup for when the transform runs. When you call it from within def percentile_95(df):
you're actually calling the annotation from within a transform here:
@transform(
data_out=Output("output path"),
data_in=Input("input path")
)
def percentile_95_transform(data_in, data_out):
data_out.write_dataframe(
percentile_95( # <-- here we're inside a transform
data_in.dataframe()))
If you want to-reuse these UDFs in multiple places, maybe you could wrap them in a function or a class that you initialise inside each transform that you want to use.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.