![](/img/trans.png)
[英]How do I create a Palantir Foundry Fusion sync for a timestamp column?
[英]Palantir Foundry incremental testing is hard to iterate on, how do I find bugs faster?
我的 Foundry 實例中有一個使用增量計算的管道設置,但由於某種原因沒有達到我的預期。 即,我想讀取我的轉換的先前輸出並獲取日期的最大值,然后僅在此最大日期之后立即讀取數據的輸入。
出於某種原因,它沒有按照我的預期運行,並且在構建/分析/修改代碼過程中逐步執行代碼非常令人沮喪。
我的代碼如下所示:
from pypsark.sql import functions as F, types as T, DataFrame
from transforms.api import transform, Input, Output, incremental
from datetime import date, timedelta
JUMP_DAYS = 1
START_DATE = date(year=2021, month=10, day=1)
OUTPUT_SCHEMA = T.StructType([
T.StructField("date", T.DateType()),
T.StructField("value", T.IntegerType())
])
@incremental(semantic_version=1)
@transform(
my_input=Input("/path/to/my/input"),
my_output=Output("/path/to/my/output")
)
def only_write_one_day(my_input, my_output):
"""Filter the input to only rows that are a day after the last written output and process them"""
# Get the previous output and full current input
previous_output_df = my_output.dataframe("previous", output_schema)
current_input_df = my_input.dataframe("current")
# Get the next date of interest from the previous output
previous_max_date_rows = previous_output_df.groupBy().agg(
F.max(F.col("date")).alias("max_date")
).collect() # noqa
# PERFORMANCE NOTE: It is acceptable to collect the max value here to avoid cross-join-filter expensive
# operation in favor of making a new query plan.
if len(previous_max_date_rows) == 0:
# We are running for the first time or re-snapshotting. There's no previous date. Use fallback.
previous_max_date = START_DATE
else:
# We have a previous max date, use it.
previous_max_date = previous_max_date_rows[0][0]
delta = timedelta(days=JUMP_DAYS)
next_date = previous_max_date + delta
# Filter the input to only the next date
filtered_input = current_input_df.filter(F.col("date") == F.lit(date))
# Do any other processing...
output_df = filtered_input
# Persist
my_output.set_mode("modify")
my_output.write_dataframe(output_df)
在增量轉換中,很難區分存在哪些條件會破壞您的代碼。 因此,通常最好:
在您的代碼示例中,將執行分解為一堆可測試的方法將使測試和查看問題變得更加容易。
新方法應如下所示:
from pypsark.sql import functions as F, types as T, DataFrame
from transforms.api import transform, Input, Output, incremental
from datetime import date, timedelta
JUMP_DAYS = 1
START_DATE = date(year=2021, month=10, day=1)
OUTPUT_SCHEMA = T.StructType([
T.StructField("date", T.DateType()),
T.StructField("value", T.IntegerType())
])
def get_previous_max_date(previous_output_df) -> date:
"""Given the previous output, get the maximum date written to it"""
previous_max_date_rows = previous_output_df.groupBy().agg(
F.max(F.col("date")).alias("max_date")
).collect() # noqa
# PERFORMANCE NOTE: It is acceptable to collect the max value here to avoid cross-join-filter expensive
# operation in favor of making a new query plan.
if len(previous_max_date_rows) == 0:
# We are running for the first time or re-snapshotting. There's no previous date. Use fallback.
previous_max_date = START_DATE
else:
# We have a previous max date, use it.
previous_max_date = previous_max_date_rows[0][0]
return previous_max_date
def get_next_date(previous_output_df) -> date:
"""Given the previous output, compute the max date + 1 day"""
previous_max_date = get_previous_max_date(previous_output_df)
delta = timedelta(days=JUMP_DAYS)
next_date = previous_max_date + delta
return next_date
def filter_input_to_date(current_input_df: DataFrame, date_filter: date) -> DataFrame:
"""Given the entire intput, filter to only rows that have the next date"""
return current_input_df.filter(F.col("date") == F.lit(date))
def process_with_dfs(current_input_df, previous_output_df) -> DataFrame:
"""With the constructed DataFrames, do our work"""
# Get the next date of interest from the previous output
next_date = get_next_date(previous_output_df)
# Filter the input to only the next date
filtered_input = filter_input_to_date(current_input_df, next_date)
# Do any other processing...
return filtered_input
@incremental(semantic_version=1)
@transform(
my_input=Input("/path/to/my/input"),
my_output=Output("/path/to/my/output")
)
def only_write_one_day(my_input, my_output):
"""Filter the input to only rows that are a day after the last written output and process them"""
# Get the previous output and full current input
previous_output_df = my_output.dataframe("previous", output_schema)
current_input_df = my_input.dataframe("current")
# Do the processing
output_df = process_with_dfs(current_input_df, previous_output_df)
# Persist
my_output.set_mode("modify")
my_output.write_dataframe(output_df)
您現在可以設置單獨的單元測試,假設您的代碼位於transforms-python/src/myproject/datasets/output.py
,按照此處的方法正確設置所有內容。
因此,我的測試文件現在如下所示:
from pyspark.sql import functions as F, types as T
from myproject.datasets import (
only_write_one_day,
process_with_dfs,
filter_input_to_date,
get_next_date,
get_previous_max_date,
OUTPUT_SCHEMA,
JUMP_DAYS,
START_DATE
)
import pytest
from datetime import date
@pytest.fixture
def empty_output_df(spark_session):
data = []
return spark_session.createDataFrame(data, OUTPUT_SCHEMA)
@pytest.fixture
def single_write_output_df(spark_session):
data = [{
"date": date(year=2021, month=10, day=1),
"value": 1
}]
return spark_session.createDataFrame(data, OUTPUT_SCHEMA)
@pytest.fixture
def double_write_output_df(spark_session):
data = [
{
"date": date(year=2021, month=10, day=1),
"value": 1
},
{
"date": date(year=2021, month=10, day=2),
"value": 2
}
]
return spark_session.createDataFrame(data, OUTPUT_SCHEMA)
@pytest.fixture
def normal_input_df(spark_session):
data = [
{
"date": date(year=2021, month=10, day=1),
"value": 1
},
{
"date": date(year=2021, month=10, day=2),
"value": 2
}
]
return spark_session.createDataFrame(data, OUTPUT_SCHEMA)
# ======= FIRST RUN CASE
def test_first_run_process_with_dfs(normal_input_df, empty_output_df):
assert True
def test_first_run_filter_input_to_date(normal_input_df, empty_output_df):
assert True
def test_first_run_get_next_date(normal_input_df, empty_output_df):
assert True
def test_first_run_get_previous_max_date(normal_input_df, empty_output_df):
assert True
# ======= NORMAL CASE
def test_normal_run_process_with_dfs(normal_input_df, single_write_output_df):
assert True
def test_normal_run_filter_input_to_date(normal_input_df, single_write_output_df):
assert True
def test_normal_run_get_next_date(normal_input_df, single_write_output_df):
assert True
def test_normal_run_get_previous_max_date(normal_input_df, single_write_output_df):
assert True
值得注意的是,這就是為什么您可以在 Foundry 中啟用 McCabe 復雜性檢查器和單元測試覆蓋功能等功能的原因,這樣您就可以將代碼分解為更小、更耐用的部分,例如這樣。
遵循這樣的設計模式將為您提供更持久的代碼,在增量轉換中更值得信賴。
如果采用變換的這種風格,你也將可以迭代更快的運行你正在尋找使用“測試”的代碼庫功能的單獨測試完善你的邏輯。 您可以打開測試文件並單擊您感興趣的特定案例旁邊的綠色“測試”按鈕,這將使您編寫邏輯的速度比每次單擊構建並嘗試像您一樣排列輸入條件要快得多想。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.