简体   繁体   English

Palantir Foundry 增量测试很难迭代,我如何更快地发现错误?

[英]Palantir Foundry incremental testing is hard to iterate on, how do I find bugs faster?

I have a pipeline setup in my Foundry instance that is using incremental computation but for some reason isn't doing what I expect.我的 Foundry 实例中有一个使用增量计算的管道设置,但由于某种原因没有达到我的预期。 Namely, I want to read the previous output of my transform and get the maximum value of a date, then read the input only for data immediately after this maximum date.即,我想读取我的转换的先前输出并获取日期的最大值,然后仅在此最大日期之后立即读取数据的输入。

For some reason, it isn't doing what I expect and it's quite frustrating to step through the code on a build / analyze / modify code process.出于某种原因,它没有按照我的预期运行,并且在构建/分析/修改代码过程中逐步执行代码非常令人沮丧。

My code looks like the following:我的代码如下所示:

from pypsark.sql import functions as F, types as T, DataFrame
from transforms.api import transform, Input, Output, incremental
from datetime import date, timedelta


JUMP_DAYS = 1
START_DATE = date(year=2021, month=10, day=1)
OUTPUT_SCHEMA = T.StructType([
  T.StructField("date", T.DateType()),
  T.StructField("value", T.IntegerType())
])


@incremental(semantic_version=1)
@transform(
    my_input=Input("/path/to/my/input"),
    my_output=Output("/path/to/my/output")
)
def only_write_one_day(my_input, my_output):
  """Filter the input to only rows that are a day after the last written output and process them"""

  # Get the previous output and full current input
  previous_output_df = my_output.dataframe("previous", output_schema)
  current_input_df = my_input.dataframe("current")

  # Get the next date of interest from the previous output 
  previous_max_date_rows = previous_output_df.groupBy().agg(
      F.max(F.col("date")).alias("max_date")
  ).collect() # noqa
  # PERFORMANCE NOTE: It is acceptable to collect the max value here to avoid cross-join-filter expensive
  #   operation in favor of making a new query plan. 

  if len(previous_max_date_rows) == 0:
    # We are running for the first time or re-snapshotting.  There's no previous date.  Use fallback.  
    previous_max_date = START_DATE
  else:
    # We have a previous max date, use it. 
    previous_max_date = previous_max_date_rows[0][0]

  delta = timedelta(days=JUMP_DAYS)
  next_date = previous_max_date + delta

  # Filter the input to only the next date
  filtered_input = current_input_df.filter(F.col("date") == F.lit(date))

  # Do any other processing...

  output_df = filtered_input

  # Persist 
  my_output.set_mode("modify")
  my_output.write_dataframe(output_df)

In incremental transforms, it can be difficult to isolate what conditions exist that are breaking your code.在增量转换中,很难区分存在哪些条件会破坏您的代码。 As such, it's typically best to:因此,通常最好:

  1. Make your compute function do nothing besides fetch the appropriate views of your inputs / outputs and pass these DataFrames off to interior methods除了获取输入/输出的适当视图并将这些数据帧传递给内部方法之外,让您的计算函数不执行任何操作
  2. Modularize every piece of your logic to make it testable模块化您的每一块逻辑以使其可测试
  3. Write tests for each piece that validate each manipulation of a specific DataFrame does what you expect.为每个部分编写测试以验证对特定 DataFrame 的每次操作是否符合您的预期。

In your code example, breaking up the execution to a bunch of testable methods will make it substantially easier to test it and see what's wrong.在您的代码示例中,将执行分解为一堆可测试的方法将使测试和查看问题变得更加容易。

The new method should look something like this:新方法应如下所示:

from pypsark.sql import functions as F, types as T, DataFrame
from transforms.api import transform, Input, Output, incremental
from datetime import date, timedelta


JUMP_DAYS = 1
START_DATE = date(year=2021, month=10, day=1)
OUTPUT_SCHEMA = T.StructType([
  T.StructField("date", T.DateType()),
  T.StructField("value", T.IntegerType())
])


def get_previous_max_date(previous_output_df) -> date:
  """Given the previous output, get the maximum date written to it"""
  previous_max_date_rows = previous_output_df.groupBy().agg(
      F.max(F.col("date")).alias("max_date")
  ).collect() # noqa
  # PERFORMANCE NOTE: It is acceptable to collect the max value here to avoid cross-join-filter expensive
  #   operation in favor of making a new query plan. 

  if len(previous_max_date_rows) == 0:
    # We are running for the first time or re-snapshotting.  There's no previous date.  Use fallback.  
    previous_max_date = START_DATE
  else:
    # We have a previous max date, use it. 
    previous_max_date = previous_max_date_rows[0][0]
  return previous_max_date


def get_next_date(previous_output_df) -> date:
  """Given the previous output, compute the max date + 1 day"""
  previous_max_date = get_previous_max_date(previous_output_df)
  delta = timedelta(days=JUMP_DAYS)
  next_date = previous_max_date + delta
  return next_date


def filter_input_to_date(current_input_df: DataFrame, date_filter: date) -> DataFrame:
  """Given the entire intput, filter to only rows that have the next date"""
  return current_input_df.filter(F.col("date") == F.lit(date))


def process_with_dfs(current_input_df, previous_output_df) -> DataFrame:
  """With the constructed DataFrames, do our work"""
  # Get the next date of interest from the previous output 
  next_date = get_next_date(previous_output_df)

  # Filter the input to only the next date
  filtered_input = filter_input_to_date(current_input_df, next_date)

  # Do any other processing...

  return filtered_input


@incremental(semantic_version=1)
@transform(
    my_input=Input("/path/to/my/input"),
    my_output=Output("/path/to/my/output")
)
def only_write_one_day(my_input, my_output):
  """Filter the input to only rows that are a day after the last written output and process them"""

  # Get the previous output and full current input
  previous_output_df = my_output.dataframe("previous", output_schema)
  current_input_df = my_input.dataframe("current")

  # Do the processing
  output_df = process_with_dfs(current_input_df, previous_output_df)

  # Persist 
  my_output.set_mode("modify")
  my_output.write_dataframe(output_df)

You can now set up individual unit tests, assuming your code lives at transforms-python/src/myproject/datasets/output.py , following the methodology here to set everything up correctly.您现在可以设置单独的单元测试,假设您的代码位于transforms-python/src/myproject/datasets/output.py ,按照此处的方法正确设置所有内容。

Therefore my testing file now looks like the following:因此,我的测试文件现在如下所示:

from pyspark.sql import functions as F, types as T
from myproject.datasets import (
    only_write_one_day,
    process_with_dfs,
    filter_input_to_date,
    get_next_date,
    get_previous_max_date,
    OUTPUT_SCHEMA,
    JUMP_DAYS,
    START_DATE
)
import pytest
from datetime import date


@pytest.fixture
def empty_output_df(spark_session):
    data = []
    return spark_session.createDataFrame(data, OUTPUT_SCHEMA)


@pytest.fixture
def single_write_output_df(spark_session):
    data = [{
        "date": date(year=2021, month=10, day=1),
        "value": 1
    }]
    return spark_session.createDataFrame(data, OUTPUT_SCHEMA)


@pytest.fixture
def double_write_output_df(spark_session):
    data = [
        {
            "date": date(year=2021, month=10, day=1),
            "value": 1
        },
        {
            "date": date(year=2021, month=10, day=2),
            "value": 2
        }
    ]
    return spark_session.createDataFrame(data, OUTPUT_SCHEMA)


@pytest.fixture
def normal_input_df(spark_session):
    data = [
        {
            "date": date(year=2021, month=10, day=1),
            "value": 1
        },
        {
            "date": date(year=2021, month=10, day=2),
            "value": 2
            }
    ]
    return spark_session.createDataFrame(data, OUTPUT_SCHEMA)


# ======= FIRST RUN CASE

def test_first_run_process_with_dfs(normal_input_df, empty_output_df):
    assert True


def test_first_run_filter_input_to_date(normal_input_df, empty_output_df):
    assert True


def test_first_run_get_next_date(normal_input_df, empty_output_df):
    assert True


def test_first_run_get_previous_max_date(normal_input_df, empty_output_df):
    assert True


# ======= NORMAL CASE

def test_normal_run_process_with_dfs(normal_input_df, single_write_output_df):
    assert True


def test_normal_run_filter_input_to_date(normal_input_df, single_write_output_df):
    assert True


def test_normal_run_get_next_date(normal_input_df, single_write_output_df):
    assert True


def test_normal_run_get_previous_max_date(normal_input_df, single_write_output_df):
    assert True

It's worth noting this is why you can enable things like McCabe complexity checkers and unit test coverage features inside Foundry so you can break up your code into smaller more durable pieces like this.值得注意的是,这就是为什么您可以在 Foundry 中启用 McCabe 复杂性检查器和单元测试覆盖功能等功能的原因,这样您就可以将代码分解为更小、更耐用的部分,例如这样。

Following a design pattern like this will give you much more durable code that is more trustworthy in incremental transforms.遵循这样的设计模式将为您提供更持久的代码,在增量转换中更值得信赖。

If you adopt this style of transform, you will also be able to iterate much faster on perfecting your logic by running the individual test you are looking for using the Code Repository feature of "Test".如果采用变换的这种风格,你也将可以迭代更快的运行你正在寻找使用“测试”的代码库功能的单独测试完善你的逻辑。 You can open the test file and click the green "Test" button next to the specific case you are interested in, which will let you get your logic written much faster than clicking build every time and trying to get your input conditions lined up like you want.您可以打开测试文件并单击您感兴趣的特定案例旁边的绿色“测试”按钮,这将使您编写逻辑的速度比每次单击构建并尝试像您一样排列输入条件要快得多想。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何为时间戳列创建 Palantir Foundry Fusion 同步? - How do I create a Palantir Foundry Fusion sync for a timestamp column? 如何使用 Palantir Foundry 在 Pyspark 中编写 case 语句 - How do I write case statements in Pyspark using Palantir Foundry 如何在 Palantir Foundry Workshop 中创建累积和图? - How do I create a cumulative sum graph in Palantir Foundry Workshop? 如何在 Palantir Foundry 中解析 xml 文档? - How do I parse xml documents in Palantir Foundry? 在 Palantir Foundry 中进行测试 - Testing in Palantir Foundry 在 Palantir Foundry 的 Workshop 中,如何清除之前保存的默认 Workshop 变量值? - In Palantir Foundry's Workshop, how do I clear the default Workshop variable values I saved earlier? 如何更新 Palantir Foundry Ontology 编辑函数中的数组属性? - How do I update an array property in a Palantir Foundry Ontology edit Function? 如何在 Pyspark 和 Palantir Foundry 中使用多个语句将列的值设置为 0 - How do I set value to 0 of column with multiple statements in Pyspark and Palantir Foundry 在 Palantir Foundry 中,如何使用 OOMing 驱动程序或执行程序解析一个非常大的 csv 文件? - In Palantir Foundry how do I parse a very large csv file with OOMing the driver or executor? 如何在 Palantir Foundry 中测试转换? - How to test a transformation in Palantir Foundry?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM