簡體   English   中英

Palantir Foundry 增量測試很難迭代,我如何更快地發現錯誤?

[英]Palantir Foundry incremental testing is hard to iterate on, how do I find bugs faster?

我的 Foundry 實例中有一個使用增量計算的管道設置,但由於某種原因沒有達到我的預期。 即,我想讀取我的轉換的先前輸出並獲取日期的最大值,然后僅在此最大日期之后立即讀取數據的輸入。

出於某種原因,它沒有按照我的預期運行,並且在構建/分析/修改代碼過程中逐步執行代碼非常令人沮喪。

我的代碼如下所示:

from pypsark.sql import functions as F, types as T, DataFrame
from transforms.api import transform, Input, Output, incremental
from datetime import date, timedelta


JUMP_DAYS = 1
START_DATE = date(year=2021, month=10, day=1)
OUTPUT_SCHEMA = T.StructType([
  T.StructField("date", T.DateType()),
  T.StructField("value", T.IntegerType())
])


@incremental(semantic_version=1)
@transform(
    my_input=Input("/path/to/my/input"),
    my_output=Output("/path/to/my/output")
)
def only_write_one_day(my_input, my_output):
  """Filter the input to only rows that are a day after the last written output and process them"""

  # Get the previous output and full current input
  previous_output_df = my_output.dataframe("previous", output_schema)
  current_input_df = my_input.dataframe("current")

  # Get the next date of interest from the previous output 
  previous_max_date_rows = previous_output_df.groupBy().agg(
      F.max(F.col("date")).alias("max_date")
  ).collect() # noqa
  # PERFORMANCE NOTE: It is acceptable to collect the max value here to avoid cross-join-filter expensive
  #   operation in favor of making a new query plan. 

  if len(previous_max_date_rows) == 0:
    # We are running for the first time or re-snapshotting.  There's no previous date.  Use fallback.  
    previous_max_date = START_DATE
  else:
    # We have a previous max date, use it. 
    previous_max_date = previous_max_date_rows[0][0]

  delta = timedelta(days=JUMP_DAYS)
  next_date = previous_max_date + delta

  # Filter the input to only the next date
  filtered_input = current_input_df.filter(F.col("date") == F.lit(date))

  # Do any other processing...

  output_df = filtered_input

  # Persist 
  my_output.set_mode("modify")
  my_output.write_dataframe(output_df)

在增量轉換中,很難區分存在哪些條件會破壞您的代碼。 因此,通常最好:

  1. 除了獲取輸入/輸出的適當視圖並將這些數據幀傳遞給內部方法之外,讓您的計算函數不執行任何操作
  2. 模塊化您的每一塊邏輯以使其可測試
  3. 為每個部分編寫測試以驗證對特定 DataFrame 的每次操作是否符合您的預期。

在您的代碼示例中,將執行分解為一堆可測試的方法將使測試和查看問題變得更加容易。

新方法應如下所示:

from pypsark.sql import functions as F, types as T, DataFrame
from transforms.api import transform, Input, Output, incremental
from datetime import date, timedelta


JUMP_DAYS = 1
START_DATE = date(year=2021, month=10, day=1)
OUTPUT_SCHEMA = T.StructType([
  T.StructField("date", T.DateType()),
  T.StructField("value", T.IntegerType())
])


def get_previous_max_date(previous_output_df) -> date:
  """Given the previous output, get the maximum date written to it"""
  previous_max_date_rows = previous_output_df.groupBy().agg(
      F.max(F.col("date")).alias("max_date")
  ).collect() # noqa
  # PERFORMANCE NOTE: It is acceptable to collect the max value here to avoid cross-join-filter expensive
  #   operation in favor of making a new query plan. 

  if len(previous_max_date_rows) == 0:
    # We are running for the first time or re-snapshotting.  There's no previous date.  Use fallback.  
    previous_max_date = START_DATE
  else:
    # We have a previous max date, use it. 
    previous_max_date = previous_max_date_rows[0][0]
  return previous_max_date


def get_next_date(previous_output_df) -> date:
  """Given the previous output, compute the max date + 1 day"""
  previous_max_date = get_previous_max_date(previous_output_df)
  delta = timedelta(days=JUMP_DAYS)
  next_date = previous_max_date + delta
  return next_date


def filter_input_to_date(current_input_df: DataFrame, date_filter: date) -> DataFrame:
  """Given the entire intput, filter to only rows that have the next date"""
  return current_input_df.filter(F.col("date") == F.lit(date))


def process_with_dfs(current_input_df, previous_output_df) -> DataFrame:
  """With the constructed DataFrames, do our work"""
  # Get the next date of interest from the previous output 
  next_date = get_next_date(previous_output_df)

  # Filter the input to only the next date
  filtered_input = filter_input_to_date(current_input_df, next_date)

  # Do any other processing...

  return filtered_input


@incremental(semantic_version=1)
@transform(
    my_input=Input("/path/to/my/input"),
    my_output=Output("/path/to/my/output")
)
def only_write_one_day(my_input, my_output):
  """Filter the input to only rows that are a day after the last written output and process them"""

  # Get the previous output and full current input
  previous_output_df = my_output.dataframe("previous", output_schema)
  current_input_df = my_input.dataframe("current")

  # Do the processing
  output_df = process_with_dfs(current_input_df, previous_output_df)

  # Persist 
  my_output.set_mode("modify")
  my_output.write_dataframe(output_df)

您現在可以設置單獨的單元測試,假設您的代碼位於transforms-python/src/myproject/datasets/output.py ,按照此處的方法正確設置所有內容。

因此,我的測試文件現在如下所示:

from pyspark.sql import functions as F, types as T
from myproject.datasets import (
    only_write_one_day,
    process_with_dfs,
    filter_input_to_date,
    get_next_date,
    get_previous_max_date,
    OUTPUT_SCHEMA,
    JUMP_DAYS,
    START_DATE
)
import pytest
from datetime import date


@pytest.fixture
def empty_output_df(spark_session):
    data = []
    return spark_session.createDataFrame(data, OUTPUT_SCHEMA)


@pytest.fixture
def single_write_output_df(spark_session):
    data = [{
        "date": date(year=2021, month=10, day=1),
        "value": 1
    }]
    return spark_session.createDataFrame(data, OUTPUT_SCHEMA)


@pytest.fixture
def double_write_output_df(spark_session):
    data = [
        {
            "date": date(year=2021, month=10, day=1),
            "value": 1
        },
        {
            "date": date(year=2021, month=10, day=2),
            "value": 2
        }
    ]
    return spark_session.createDataFrame(data, OUTPUT_SCHEMA)


@pytest.fixture
def normal_input_df(spark_session):
    data = [
        {
            "date": date(year=2021, month=10, day=1),
            "value": 1
        },
        {
            "date": date(year=2021, month=10, day=2),
            "value": 2
            }
    ]
    return spark_session.createDataFrame(data, OUTPUT_SCHEMA)


# ======= FIRST RUN CASE

def test_first_run_process_with_dfs(normal_input_df, empty_output_df):
    assert True


def test_first_run_filter_input_to_date(normal_input_df, empty_output_df):
    assert True


def test_first_run_get_next_date(normal_input_df, empty_output_df):
    assert True


def test_first_run_get_previous_max_date(normal_input_df, empty_output_df):
    assert True


# ======= NORMAL CASE

def test_normal_run_process_with_dfs(normal_input_df, single_write_output_df):
    assert True


def test_normal_run_filter_input_to_date(normal_input_df, single_write_output_df):
    assert True


def test_normal_run_get_next_date(normal_input_df, single_write_output_df):
    assert True


def test_normal_run_get_previous_max_date(normal_input_df, single_write_output_df):
    assert True

值得注意的是,這就是為什么您可以在 Foundry 中啟用 McCabe 復雜性檢查器和單元測試覆蓋功能等功能的原因,這樣您就可以將代碼分解為更小、更耐用的部分,例如這樣。

遵循這樣的設計模式將為您提供更持久的代碼,在增量轉換中更值得信賴。

如果采用變換的這種風格,你也將可以迭代更快的運行你正在尋找使用“測試”的代碼庫功能的單獨測試完善你的邏輯。 您可以打開測試文件並單擊您感興趣的特定案例旁邊的綠色“測試”按鈕,這將使您編寫邏輯的速度比每次單擊構建並嘗試像您一樣排列輸入條件要快得多想。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM