简体   繁体   English

在 Palantir Foundry 中,由于无法使用打印语句,我该如何调试 pyspark(或 pandas)UDF?

[英]In Palantir Foundry, how do I debug pyspark (or pandas) UDFs since I can't use print statements?

In Code Workbooks, I can use print statements, which appear in the Output section of the Code Workbook (where errors normally appear).在代码工作簿中,我可以使用print语句,这些语句出现在代码工作簿的 Output 部分(通常会出现错误)。 This does not work for UDFs, and it also doesn't work in Code Authoring/Repositories.这不适用于 UDF,也不适用于代码创作/存储库。

What are ways I can debug my pyspark code, especially if I'm using UDFs?我可以通过哪些方式调试我的 pyspark 代码,尤其是在我使用 UDF 时?

I will explain 3 debugging tools for pyspark (and usable in Foundry):我将解释 pyspark 的 3 个调试工具(可在 Foundry 中使用):

  1. Raising Exceptions引发异常
  2. Running locally as a pandas series作为 pandas 系列在本地运行
  3. Logging and specifically logging in UDFs记录并专门记录 UDF

Raising Exceptions引发异常

The easiest, quickest way to view a variable, especially for pandas UDFs is to Raise an Exception .查看变量(尤其是 pandas UDF)的最简单、最快捷的方法是引发异常

def my_compute_function(my_input):
    interesting_variable = some_function(my_input)  # Want to see result of this
    raise ValueError(interesting_variable)

This is often easier than reading/writing DataFrames because:这通常比读/写 DataFrame 更容易,因为:

  1. Can easily insert a raise statement without messing with the transform's return value or other logic可以很容易地插入一个 raise 语句,而不会弄乱转换的返回值或其他逻辑
  2. Don't need to mess around with defining a valid schema for your debug statement不需要为你的调试语句定义一个有效的模式而乱七八糟

The downside is that it stops the execution of the code.缺点是它会停止代码的执行。

Running locally as a pandas series作为 pandas 系列在本地运行

If you are more experienced with Pandas, you use a small sample of the data, and run your algoritm on the driver as a pandas series where you can do debugging.如果您对 Pandas 更有经验,您可以使用少量数据样本,并在驱动程序上运行您的算法作为 pandas 系列,您可以在其中进行调试。

Some techniques I previously used is not just downsampling the data by a number of rows, rather I filtered the data to be representative of my work.我以前使用的一些技术不仅仅是按行数对数据进行下采样,而是过滤数据以代表我的工作。 For example if I was writing an algorithm to determine flight delays, I would filter to all flights to a specific airport on a specific day.例如,如果我正在编写一个算法来确定航班延误,我会过滤到特定日期飞往特定机场的所有航班。 This way I'm testing holistically on the sample.这样我就可以对样本进行整体测试。

Logging记录

Code Repositories uses Python's built in logging library .代码存储库使用 Python 的内置日志库 This is widely documented online and allows you to control logging level (ERROR, WARNING, INFO) for easier filtering.这在网上被广泛记录,并允许您控制日志记录级别(错误、警告、信息)以便于过滤。

Logging output appears in both your output dataset's log files, and in your build's driver logs (Dataset -> Details -> Files -> Log Files, and Builds -> Build -> Job status logs; select "Driver logs", respectively).记录 output 出现在您的 output 数据集的日志文件和构建的驱动程序日志中(数据集 -> 详细信息 -> 文件 -> 日志文件和构建 -> 构建 -> 作业状态日志;select“驱动程序日志”,分别)。

This would allow you to view the logged information in the logs (after the build completes), but doesn't work for UDFs.这将允许您在日志中查看记录的信息(在构建完成后),但不适用于 UDF。

Logging in UDFs登录 UDF

The work done by the UDF is done by the executor not the driver, and Spark captures the logging output from the top-level driver process. UDF 完成的工作是由执行程序而不是驱动程序完成的,Spark 从顶级驱动程序进程中捕获日志记录 output。 If you are using a UDF within your PySpark query and need to log data, create and call a second UDF that returns the data you wish to capture and store it in a column to view once the build is finished:如果您在 PySpark 查询中使用 UDF 并且需要记录数据,请创建并调用第二个 UDF,该 UDF 返回您希望捕获的数据并将其存储在列中以便在构建完成后查看:

@transform_df(
    ...
)
def some_transformation(some_input):
    logger.info("log output related to the overall query")
    
    @F.udf("integer")
    def custom_function(integer_input):
        return integer_input + 5
    
    @F.udf("string")
    def custom_log(integer_input):
        return "Original integer was %d before adding 5" % integer_input
    
    df = (
        some_input
        .withColumn("new_integer", custom_function(F.col("example_integer_col"))
        .withColumn("debugging", custom_log(F.col("example_integer_col"))
    )

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 Palantir Foundry 在 Pyspark 中编写 case 语句 - How do I write case statements in Pyspark using Palantir Foundry 如何在 Pyspark 和 Palantir Foundry 中使用多个语句将列的值设置为 0 - How do I set value to 0 of column with multiple statements in Pyspark and Palantir Foundry 如何在 Palantir Foundry 中解析 xml 文档? - How do I parse xml documents in Palantir Foundry? 如何在pyspark groupby上使用带有熊猫的UDF? - How to use UDFs with pandas on pyspark groupby? 在 Palantir Foundry 中,如何使用 OOMing 驱动程序或执行程序解析一个非常大的 csv 文件? - In Palantir Foundry how do I parse a very large csv file with OOMing the driver or executor? 如何在代码工作簿中合并 Palantir Foundry 中的两个数据集? - How do I union two datasets in Palantir Foundry within a code workbook? 如何在代码工作簿中加入 Palantir Foundry 中的两个数据集? - How do I JOIN two datasets in Palantir Foundry within a code workbook? 如何在 Palantir Foundry 中检查列是否始终具有相同的值? - How do I check a column always has the same value in Palantir Foundry? 如何在不使用 Spark 的情况下调试 pandas_udfs? - How to debug pandas_udfs without having to use Spark? 我想从数据集创建列表并在 palantir 铸造厂的另一个 function 中使用,但找不到任何解决方案 - I want to create the list from the datasest and use in another function in palantir foundry but not able to find any solution
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM