简体繁体中英

How to force an incremental Foundry Transforms job to build non-incrementally without bumping the semantic version?

原文 2020-09-24 21:32:33 5 4 palantir-foundry/ foundry-code-repositories

How can I force a particular dataset to build non-incrementally without changing the semantic version in the transforms repo?

Details about our specific use case:

We have about 50 datasets defined by a single incremental python via manual registration and a for-loop. The input to this transform can be between 100's and 10000's of small gzip files, so when the larger dataset runs, it ends up partitioning all of these into only a handful of well-sized parquet files, which is perfect for our downstream jobs. However, after this job has been running incrementally for months (with files arriving every hour), there will also be a large number of small parquet files in the output. We'd like to be able to force a snapshot build of this single dataset without having to bump the semantic version of the transform which would trigger snapshot builds for all 50 datasets. Is this possible?

I understand a potential workaround could be defining a "max output files" in the transform itself, reading the current number of files in the existing output, and forcing a snapshot if the current exceeds the maximum. However, since this pipeline is time sensitive (needs to run in under an hour), this would introduce a level of unpredictability to the pipeline since the snapshot build takes much longer. We'd like to be able to set these full snapshot builds to run about once a month on a weekend.

4 answers

在输出数据集上提交一个空的追加事务。

My preferred approach to do this these days, is to use what I call a "snapshot dataset". This approach allows you to inject a snapshot transaction into your pipeline at any arbitrary point, as well as schedule snapshot builds at regular intervals, which can be very useful for keeping long-lived low-latency pipelines performant.

For this, I use a wrapper when declaring my transforms (java transforms, in my case, but it applies similarly to python) which adds an additional input to my transform.

Let's say you start with a transform that reads datasets A and B and produces dataset C . The wrapper will insert an additional dataset as input called CSnapshotDataset , as well as generate a transform that produces this (empty) dataset.

The automatically generated transform that produces CSnapshotDataset will always put an empty SNAPSHOT transaction into the dataset whenever it is build. When there's a new snapshot transaction coming from CSnapshotDataset , your transform will output a snapshot transaction as well.

To then snapshot your pipeline from a given point onwards, for instance from and including dataset C , you simply select C s snapshot dataset ( CSnapshotDataset in this case) and build it. The next (scheduled) run of the pipeline will snapshot C and everything downstream from it.

To run this on a regular interval, you can then set a schedule to build CSnapshotDataset .

I apply this wrapper generously (generally to any transform I write) which gives me the flexibility to snapshot the pipeline from any dataset that might need it.

While it's a little more up-front work to set this up, the major advantages with this are:

It's a single click to kick off a snapshot and a few clicks to set up a scheduled snapshot, rather than having to do multiple curl calls
It keeps the transaction history of the input and output datasets clean
It happens entirely in-platform with no need for extracting a token, using a command-line, jenkins or similar

I think you could

for the input: input = input.dataframe('current')

for the output: output.set_mode('replace')

I think you simply decide at run time whether to use TransformOutput.set_mode() in your output to replace or modify . This way, you could decide based on the sizing of your inputs if you'd like to overwrite or append to the output

How do I know if my Foundry job is using incremental computation?

How do I build a large incremental output dataset from an existing large incremental input dataset in Foundry?

How can I read and write column descriptions and typeclasses in foundry transforms?

How can I reduce compute costs and waste in my Foundry transforms?

How do I use a local IDE for Java Transforms in Foundry Code Repositories?

How do I set a variable in Foundry's SQL Transforms?

Palantir Foundry incremental testing is hard to iterate on, how do I find bugs faster?

How to use spark.DataFrameReader from Foundry Transforms

How do I know if my Foundry job is using static vs. dynamic allocation?

Saving CSV files in Foundry Transforms

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How do I know if my Foundry job is using incremental computation? How do I build a large incremental output dataset from an existing large incremental input dataset in Foundry? How can I read and write column descriptions and typeclasses in foundry transforms? How can I reduce compute costs and waste in my Foundry transforms? How do I use a local IDE for Java Transforms in Foundry Code Repositories? How do I set a variable in Foundry's SQL Transforms? Palantir Foundry incremental testing is hard to iterate on, how do I find bugs faster? How to use spark.DataFrameReader from Foundry Transforms How do I know if my Foundry job is using static vs. dynamic allocation? Saving CSV files in Foundry Transforms

Related Tags

How to force an incremental Foundry Transforms job to build non-incrementally without bumping the semantic version?

Question

4 answers

solution1
3 ACCPTED 2020-09-24 21:40:07

solution2
1 2020-10-12 11:50:54

solution3
0 2020-09-25 08:03:06

solution4
0 2020-09-25 13:46:26

How to force an incremental Foundry Transforms job to build non-incrementally without bumping the semantic version?

Question

4 answers

solution1 3 ACCPTED 2020-09-24 21:40:07

solution2 1 2020-10-12 11:50:54

solution3 0 2020-09-25 08:03:06

solution4 0 2020-09-25 13:46:26

solution1
3 ACCPTED 2020-09-24 21:40:07

solution2
1 2020-10-12 11:50:54

solution3
0 2020-09-25 08:03:06

solution4
0 2020-09-25 13:46:26