[英]In Palantir Foundry, how do I debug pyspark (or pandas) UDFs since I can't use print statements?
In Code Workbooks, I can use print
statements, which appear in the Output section of the Code Workbook (where errors normally appear).在代码工作簿中,我可以使用print
语句,这些语句出现在代码工作簿的 Output 部分(通常会出现错误)。 This does not work for UDFs, and it also doesn't work in Code Authoring/Repositories.这不适用于 UDF,也不适用于代码创作/存储库。
What are ways I can debug my pyspark code, especially if I'm using UDFs?我可以通过哪些方式调试我的 pyspark 代码,尤其是在我使用 UDF 时?
I will explain 3 debugging tools for pyspark (and usable in Foundry):我将解释 pyspark 的 3 个调试工具(可在 Foundry 中使用):
The easiest, quickest way to view a variable, especially for pandas UDFs is to Raise an Exception .查看变量(尤其是 pandas UDF)的最简单、最快捷的方法是引发异常。
def my_compute_function(my_input):
interesting_variable = some_function(my_input) # Want to see result of this
raise ValueError(interesting_variable)
This is often easier than reading/writing DataFrames because:这通常比读/写 DataFrame 更容易,因为:
The downside is that it stops the execution of the code.缺点是它会停止代码的执行。
If you are more experienced with Pandas, you use a small sample of the data, and run your algoritm on the driver as a pandas series where you can do debugging.如果您对 Pandas 更有经验,您可以使用少量数据样本,并在驱动程序上运行您的算法作为 pandas 系列,您可以在其中进行调试。
Some techniques I previously used is not just downsampling the data by a number of rows, rather I filtered the data to be representative of my work.我以前使用的一些技术不仅仅是按行数对数据进行下采样,而是过滤数据以代表我的工作。 For example if I was writing an algorithm to determine flight delays, I would filter to all flights to a specific airport on a specific day.例如,如果我正在编写一个算法来确定航班延误,我会过滤到特定日期飞往特定机场的所有航班。 This way I'm testing holistically on the sample.这样我就可以对样本进行整体测试。
Code Repositories uses Python's built in logging library .代码存储库使用 Python 的内置日志库。 This is widely documented online and allows you to control logging level (ERROR, WARNING, INFO) for easier filtering.这在网上被广泛记录,并允许您控制日志记录级别(错误、警告、信息)以便于过滤。
Logging output appears in both your output dataset's log files, and in your build's driver logs (Dataset -> Details -> Files -> Log Files, and Builds -> Build -> Job status logs; select "Driver logs", respectively).记录 output 出现在您的 output 数据集的日志文件和构建的驱动程序日志中(数据集 -> 详细信息 -> 文件 -> 日志文件和构建 -> 构建 -> 作业状态日志;select“驱动程序日志”,分别)。
This would allow you to view the logged information in the logs (after the build completes), but doesn't work for UDFs.这将允许您在日志中查看记录的信息(在构建完成后),但不适用于 UDF。
The work done by the UDF is done by the executor not the driver, and Spark captures the logging output from the top-level driver process. UDF 完成的工作是由执行程序而不是驱动程序完成的,Spark 从顶级驱动程序进程中捕获日志记录 output。 If you are using a UDF within your PySpark query and need to log data, create and call a second UDF that returns the data you wish to capture and store it in a column to view once the build is finished:如果您在 PySpark 查询中使用 UDF 并且需要记录数据,请创建并调用第二个 UDF,该 UDF 返回您希望捕获的数据并将其存储在列中以便在构建完成后查看:
@transform_df(
...
)
def some_transformation(some_input):
logger.info("log output related to the overall query")
@F.udf("integer")
def custom_function(integer_input):
return integer_input + 5
@F.udf("string")
def custom_log(integer_input):
return "Original integer was %d before adding 5" % integer_input
df = (
some_input
.withColumn("new_integer", custom_function(F.col("example_integer_col"))
.withColumn("debugging", custom_log(F.col("example_integer_col"))
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.