简体   繁体   中英

How do I build a large incremental output dataset from an existing large incremental input dataset in Foundry?

I have an 80TB date-partitioned dataset in Palantir Foundry, which ingests 300-450GB of data in an incremental Append transaction every 3 hours. I want to create an incremental transform using this as an input.

However, the dataset is too large to read at once for the initial snapshot. The data appended to the dataset would be small enough to process each incremental build after an initial snapshot. How can I parse the backlog from the input dataset and reach the point where I can run my transform in incremental mode?

When reading from a large input dataset that has been constructed incrementally, it is not possible in Foundry to read from some subset of the input's transactions. You must either read the entire input dataset at once ( snapshot mode), or read only the input transactions that have been written since the last time you built the output ( incremental mode).

In order to get around this, we have to get clever with parsing the input. Here is the transform:

from transforms.api import transform, Input, Output, incremental
from pyspark.sql import Row
from pyspark.sql import functions as F, types as T, SparkSession as S
import datetime


# set this value for the type of build:
# "first" for a snapshot run on a single date (sets placeholder_date, runs snapshot)
# "catchup" for subsequent runs on subsequent dates (reads from placeholder_date to decide what date to run, then runs update from full read)
# "continuing" for ongoing incremental runs
PHASE = 'first'

# Where data begins
START_DATE = datetime.date(2022, 7, 1)
# Where we want the automated rebuild process to stop.
# Set this value to less than the most recent date for reasons discussed in the accompanying post
END_DATE = datetime.date(2022, 7, 22)
DAYS_PER_RUN = 4  # How many days worth of data do we want each 'catchup' run to read

placeholder_date_schema = T.StructType([
    T.StructField("date", T.DateType(), True)
])


@incremental(semantic_version=3)
@transform(
    output=Output("output"),
    placeholder_date=Output("placeholder_date"),
    source=Input("input"),
)
def compute(source, output, placeholder_date):

    # First and Catchup Builds
    if((PHASE == 'first') | (PHASE == 'catchup')):
        df = source.dataframe('current')  # read the entire input dataset
    # Continuing Builds
    if(PHASE == 'continuing'):
        df = source.dataframe()  # read the latest incremental appends

    # First Build: Build placeholder_date initially
    if(PHASE == 'first'):
        spark = S.builder.getOrCreate()
        next_output_last_date = START_DATE + datetime.timedelta(days=(DAYS_PER_RUN-1))
        most_recent_output_date = START_DATE - datetime.timedelta(days=1)
        placeholder_date_df = spark.createDataFrame(data=[Row(next_output_last_date)], schema=placeholder_date_schema)

    # Catchup Builds: Use placeholder_date to get the previous starting time
    if(PHASE == 'catchup'):
        placeholder_date_df = placeholder_date.dataframe('previous', placeholder_date_schema)
        most_recent_output_date = placeholder_date_df.collect()[0][0]  # noqa
        next_output_last_date = most_recent_output_date + datetime.timedelta(days=DAYS_PER_RUN)
        # Ensure that the time window doesn't go past the end date by curtailing the period if necessary
        if next_output_last_date >= END_DATE:
            next_output_last_date == END_DATE
        # Ensure we don't run once we pass the end point
        if most_recent_output_date >= END_DATE:
            return True  # this will result in the build completing without writing or reading any further data

        placeholder_date_df = placeholder_date_df.withColumn("date", F.lit(next_output_last_date))

    # First and Catchup Builds: Write the placeholder
    # It's safe to write the placeholder because if the build fails the placeholder transaction will also be aborted
    if((PHASE == 'first') | (PHASE == 'catchup')):
        placeholder_date.set_mode('replace')
        placeholder_date.write_dataframe(placeholder_date_df, output_format='csv')
        # Filter the whole input dataset
        df = df.where((F.col("date") > F.lit(most_recent_output_date)) & (F.col("date") <= F.lit(next_output_last_date)))

    # Transform the data as required
    df = transform_data(df)

    # Write the output
    output.write_dataframe(df, partition_cols=["date"])


# Define whatever transformations you want to perform here
def transform_data(df):
    return df

The transform has three "phases" - first , catchup , and continuing . You run the transform once in the first phase, then as many times as necessary in the catchup phase until the entire existing input dataset has been parsed. Finally, once that's done, you switch it to the continuing phase and schedule it to run (incrementally) each time the input updates.

The build stores state in a placeholder_date dataset, which is created in the first build, and read from during catchup builds to determine where the catchup process is up to. catchup mode has an additional failsafe where it will not write empty transactions out if the build continues past the END_DATE . This allows you to set up a (force-building) schedule (eg every 10 minutes) during the catchup phase and simply leave it, coming back to check periodically without having to time the endpoint of the catchup phase carefully. Once you have finished the catchup phase, you can set the transform to continuing mode and it will switch to fully incremental behavior.

Notes

In the sample code above, it is useful to be working with an input dataset that is hive-partitioned by date . This will make the filtering cheaper and easier. However, this will work (albeit much more slowly) without a hive-partitioned input. That said, hive-partitioning an incremental dataset by date is good practice, and the output in this example partitions the output by date for ease of future use.

CAUTION: This process assumes that the data being ingested is sequential in time, ie that data from a subsequent incremental append will have the same or later date-values as the most recent date in the previous incremental append. If your input dataset is not monotonically increasing in time, this technique can lead to data loss if not carefully managed. For example, suppose you run catchup mode for the first 4 days of data (one day of data at a time). While you're running the build for day 3, data is ingested to the input containing data for day 1. The catchup mode will not parse this data, because it has already ingested data for day 1 and subsequent catchup builds are filtering out data from day 1. Additionally, any appends to the input that occur before the final successful catchup build will not be seen by the continuing phase of the transform, because they are not new data ingested since the last successful build. If this is happening to you, you can ensure data completeness by identifying the behavior and accounting for it: Suppose that each subsequent append to your input dataset can contain data from up to 3 days back, and suppose that you want to catch up from day 1 to day 30 (today). You therefore know that no new data will be posted for days 1-27 while you run your catchup builds. If you set the END_DATE to day 27, you will have a rather large incremental build for your first continuing build, but you will not experience data loss.

NOTE: I chose to make the switch between the first , catchup , and continuing phases a manual process, for two reasons:

Firstly, you could consolidate the first and catchup phases into a single phase by wrapping the process of reading from the placeholder_date dataset in a try/catch , but this puts you in the position of relying on error-handling for control flow, which is generally unwise.

Secondly, once the catchup phase is finished, the continuing phase abandons the placeholder_date dataset, which is no longer fit for purpose (since the continuing phase is reading from transactions which may be intraday or otherwise mixed-date). Therefore, it's not possible to safely determine whether the next build should be a catchup or continuing from the existing known state.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM