简体   繁体   English

在 Palantir Foundry 代码存储库中定义 Pandas UDF 的正确方法是什么

[英]What is the proper way to define a Pandas UDF in a Palantir Foundry Code Repository

I would like to define the following pandas_udf in a Palantir Foundry code repository.我想在 Palantir Foundry 代码库中定义以下 pandas_udf。

@pandas_udf("long", PandasUDFType.GROUPED_AGG)
def percentile_95_udf(v):
    return v.quantile(0.95)

But when I try to define this udf in the global scope, I get the error:但是当我尝试在全局 scope 中定义这个 udf 时,我得到了错误:

AttributeError: 'NoneType' object has no attribute '_jvm'

However, if I define this same function within a function called by my transform the code runs fine, as in:但是,如果我在我的转换调用的 function 中定义相同的 function ,则代码运行良好,如下所示:

from pyspark.sql import functions as F
from pyspark.sql.functions import pandas_udf, PandasUDFType
from transforms.api import transform, Input, Output

@transform(
    data_out=Output("output path"),
    data_in=Input("input path")
)
def percentile_95_transform(data_in, data_out):
    data_out.write_dataframe(percentile_95(data_in.dataframe()))

def percentile_95(df):
    @pandas_udf("long", PandasUDFType.GROUPED_AGG)
    def percentile_95_udf(v):
        return v.quantile(0.95)

    # group rows for each interface into 1 day periods
    grp_by = df.groupBy(df.objectId, F.window("TimeCaptured", "1 day"))

    stats = [
        percentile_95_udf(df.ReceivedWidgets),
        percentile_95_udf(df.TransmittedWidgets),
    ]
    result = grp_by.agg(*stats)

    cleaned = result.withColumn("Day", F.col("window").start).drop("window")
    return cleaned

Why does my pandas_udf not work in global scope but does work when defined within another function?为什么我的 pandas_udf 在全局 scope 中不起作用,但在另一个 function 中定义时起作用? Also, is there a better approach to defining pandas_udf?另外,是否有更好的方法来定义 pandas_udf? Defining it as a nested function is preventing me from being able to reuse my udf.将其定义为嵌套的 function 会阻止我重用我的 udf。

For reference, my code repository in Palantir Foundry has the following structure:作为参考,我在 Palantir Foundry 中的代码存储库具有以下结构:

transforms-python
    conda_recipe
        meta.yaml
    src
        myproject
            datasets
                __init__.py
                percentile_95.py
            __init__.py
            pipeline.py
        setup.cfg
        setup.py

The reason has a root cause similar to this question: PySpark error: AttributeError: 'NoneType' object has no attribute '_jvm'原因有一个类似于这个问题的根本原因: PySpark error: AttributeError: 'NoneType' object has no attribute '_jvm'

When you do calls at global level, you are trying to execute spark commands (via pandas, in your case) before spark is set up.当您在全局级别进行调用时,您正在尝试在设置 spark 之前执行 spark 命令(在您的情况下是通过 pandas)。 When you do your call inside the transform, then spark is available so it works.当您在转换内部进行调用时,火花可用,因此可以正常工作。

The main problem here is with the annotation itself being called at top level, and spark is only setup for when the transform runs.这里的主要问题是注释本身在顶层被调用,并且 spark 仅在转换运行时设置。 When you call it from within def percentile_95(df): you're actually calling the annotation from within a transform here:当您从def percentile_95(df):中调用它时,您实际上是从此处的转换中调用注释:

@transform(
    data_out=Output("output path"),
    data_in=Input("input path")
)
def percentile_95_transform(data_in, data_out):
    data_out.write_dataframe(
                             percentile_95(  # <-- here we're inside a transform
                                           data_in.dataframe())) 

If you want to-reuse these UDFs in multiple places, maybe you could wrap them in a function or a class that you initialise inside each transform that you want to use.如果您想在多个地方重用这些 UDF,也许您可以将它们包装在 function 或 class 中,您在要使用的每个转换中初始化它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在PyCXX扩展中定义属性的正确方法是什么 - What is the proper way to define attributes in a PyCXX extension 如何在代码工作簿中合并 Palantir Foundry 中的两个数据集? - How do I union two datasets in Palantir Foundry within a code workbook? 保留 Python 工具(如类和函数)存储库的正确方法是什么? - What is the proper way to keep a repository of Python tools like classes and functions? 使用 MongoEngine 在应用程序启动时定义集合名称的正确方法是什么? - What is the proper way to define the collection name at the application startup with MongoEngine? 在 Python 中注释代码的正确方法是什么? - What is the proper way to comment code in Python? 猪udf中python代码的正确输入/输出? - Proper input/output for python code in a pig udf? 识别熊猫数据框中列的数据类型的正确方法是什么? - What is the proper way to identify the data type of columns in a pandas dataframe? 将时间序列数据与大熊猫中的元数据相结合的正确方法是什么? - What is the proper way of combining time series data with metadata in pandas? 将按钮单击编码为事件语句的正确方法是什么? - What is the proper way to code a button click into an event statement? 测试抛出IntegrityError的SQLAlchemy代码的正确方法是什么? - What is a proper way to test SQLAlchemy code that throw IntegrityError?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM