How do I reduce the number of files in my foundry dataset?

Question

My Dataset has 20000 files, each very small one. How would I reduce the number of files and what would be an optimal number?

Answer 1

The most straightforward way to do this is to explicitly do a repartition() (or coalesce() if the partition count is strictly decreasing from the original number) at the end of your transformation.

This needs to be the final call before you return / write out your result.

This would look like:

# ...

@transform_df(
  # ... inputs
)
def my_compute_function(my_inputs):
  # ... my transform logic ...

  df = df.coalesce(500) 
  # df = df.repartition(500) # this also works but is slightly slower than coalesce
  return df

This is the precursor step to something called 'bucketing' for reference.

The optimal number of buckets depends upon the scale of data you are operating with. It is somewhat straightforward to calculate the optimal number of buckets by observing the total size of your dataset on disk after successful build.

If your dataset is 128GB in size, you will want to end up with 128MB files at the end, therefore your number of buckets is :

128 GB * (1000 MB / 1 GB) * (1 file / 128MB) -> 1000 files

NOTE: this isn't an exact calculation since your final dataset size after changing bucket count will be different due to data compression used in Snappy + Parquet write-out. You'll notice that the file sizes are slightly different than you anticipated, so you may end up with 1100 or 900 files needed in the above example

Answer 2

Since this is a problem I've had to solve quite a few times, I've decided to write out a more detailed guide with a bunch of different techniques, pros and cons and a raison d'être.

Why reduce the file count?

There's a couple good reasons to avoid datasets with many files:

Read performance can be worse . When the data is fragmented across many small files, performance for applications like contour (Analysis) can seriously suffer, as the executors have to go through the overhead of downloading many small files from the backing filesystem.
If the backing filesystem is HDFS , many small files will increase heap pressure on the hadoop name-nodes and the gossip protocol. HDFS does not handle many small files terribly well, as it does not stream/paginate the list of files in the filesystem, but instead constructs messages containing a complete enumeration of all files. When you have tens or even hundreds of million of filesystem objects in HDFS, this ends up bumping into the name-node RPC message size limit (which you can increase in the config) and the available heap memory (which you can increase in the config... if you have more memory available.) Inter-node communication becomes slower and slower.
Transforms become slower , as (currently even for incremental transforms) the driver thread has to retrieve a complete list of all files in the current view from catalog, as well as metadata and provenance for transactions (which is only tangentially related, but it's not unusual that many files are correlated with many transactions)
Transforms can OOM the driver , as the set of files and set of transactions is kept in memory at some points in time. This can be solved by assigning a larger memory profile to the driver -- but this increases cost and/or decreases resources available for other pipelines.

Why do we end up with many files in a dataset in the first place?

Ending up with a dataset with many files is typically caused by one of these three reasons:

A file ingest that ingests many small files
An (ill-behaved) transform that produces many small files. Each time a wide operation in spark is executed, a shuffling can occur. For instance when a groupBy is executed (which implies a shuffle), spark will by default choose to repartition the data into 200 new partitions, which is too many for eg an incremental transform. A transform can also produce too many output files due to bad partitioning (discussed below).
A pipeline that runs incrementally and runs frequently. Every time the pipeline runs and processes a (typically small) piece of data, a new transaction is created on each dataset, each of which contains at least one file.

Next, I'll list all the methods of reducing the file-counts in datasets that I'm aware of, along with their drawbacks and advantages, as well as some characterization when they are applicable.

Upon ingest (magritte transformers)

One of the best options is to avoid having many files in the first place. When ingesting many files from eg a filesystem-like source, a magritte transformer like the "concatenating transformer" may help to combine many CSV, JSON or XML files into a single one. Concatenating and then applying the gzip transformer is a particularly effective strategy when applicable, as it often reduces the size of XML and similar text formats by 94% or so.

The major limitation is that to apply this, you need to

have multiple files available whenever the ingest runs (so not as effective for ingests that run very frequently on frequently updating data-sources)
have a data source that provides you with files that can be concatenated

It's possible to zip up many files into fewer files (using a format such as .tar.bz2, .tar.gz, .zip, .rar etc) as well, but this subsequently requires the downstream transform that is aware of this file format and manually unpacks it (an example of this is available in the documentation), as foundry is not able to transparently provide the data within these archives. There's no pre-made magritte processor that does this however, and on the occasions that I've applied this technique, I've used bash scripts to perform this task prior to ingestion, which is admittedly less-than-ideal.

Background compaction

There is a new mechanism in foundry that decouples the dataset that you write to from the dataset that is read from. There is essentially a background job running that shuffles files into an optimized index as you append them, so that reads of the dataset can (mostly) go to this optimized index instead of the (usually somewhat arbitrary) data layout that the writer left behind.

This has various benefits (like automatically producing layouts of the data that are optimized for the most common read patterns) one of them being that it can "compactify" your dataset in the background.

When reading from such a dataset, your reads essentially hit the index as well as the input dataset (which contains any files that haven't been merged by the background process into the index yet.)

The big advantage is that this happens automatically in the background, and regardless of how messy your data ingestion or transform is, you can simply write out the data (taking no perf hit on write and getting the data to the consumer ASAP) while still ending up with a nicely partitioned dataset with few files (eventually.)

The major limitation here is that this only works for datasets that are in a format that spark can natively understand, such as parquet, avro, json, csv, ... If you have eg an ingest of arbitrary files, a workaround can be to pack these up into eg parquet before ingestion. That way foundry can still merge multiple of these parquet files over time.

This feature is not quite available to end-users yet (but is planned to be enabled by default for everything.) If you think this is the most desirable solution for one of your pipelines, your palantir POC can kick off a ticket with the team to enable this feature.

repartition & coalesce

Coalescing is an operation in spark that can reduce the number of partitions without having a wide dependency (the only such operation in spark). Coalescing is fast, because it minimizes shuffling. How it works exactly has changed over previous spark versions (and there's a lot of conflicting information out there) but it's generally faster than repartition . However, it comes with a big caveat: It reduces the parallelism of your entire transform .

Even if you coalesce at the very end right before writing your data, spark will adapt the entire query plan to use fewer partitions throughout , resulting in fewer executors being used, meaning you get less parallelism.

Repartitioning is similar, but it inserts a full shuffling stage. This comes at a higher performance cost, but it means the data that comes out of this stage is essentially guaranteed to be well-partitioned (regardless the input). While repartition is somewhat expensive by itself, it does not suffer from the issue of reducing parallelism throughout the transform.

This means that overall you will typically get better performance with using repartition over coalesce if the amount of data you end up writing out is not that massive, compared to the amount of prior work you do on it, as the ability to process the data on more executors outweighs the drawback of the shuffle in the end. From my experience, repartition usually wins out here unless your transforms are very simple.

One particular use-case worth discussing is that of an incremental pipeline. If your incremental pipeline is relatively straightforward and only does eg mapping and filtering, then doing a coalesce is fine. However many incremental pipelines do also read snapshot views of very large datasets. For instance an incremental pipeline might receive one new row of data, and read the entire previous output dataset (possibly millions of rows), so see if this row already exists in the output dataset. If it already exists, no row is emitted, if it does not exist, the row is appended. Similar scenarios happen when joining a small piece of incremental data against large static datasets etc.

In this scenario, the transform is incremental, but it still benefits from high parallelism, because it still handles large amounts of data.

My rough guideline is:

transform runs as snapshot: repartition to a reasonable number
transform runs incrementally and doesn't need high parallelism: coalesce(1)
transform runs incrementally but still benefits from parallelism: repartition(1)

If write speed / pipeline latency is highly essential, neither of these options may be acceptable. In such cases, I would consider background compactification instead.

Regular snapshotting

As an extension of the previous point, to keep incremental pipelines high-performance, I like to schedule regular snapshots on them, which allows me to repartition the dataset every once in a while, performing what's basically a "compaction".

I've described a mechanism of how to set this up here: How to force an incremental Foundry Transforms job to build non-incrementally without bumping the semantic version?

I would typically schedule a snapshot on eg the weekend. Throughout the week, each dataset in the pipeline (which might have hundreds of datasets) will accumulate thousands or tens of thousands of transactions & files. Then over the weekend, as the scheduled snapshot rolls through the pipeline, each dataset will be repartitioned down to, say, a hundred files.

AQE

Somewhat recently, AQE became available in foundry. AQE essentially (for the purpose of this discussion) injects coalesce operations into stages where you already have a shuffle operation going on anyway, depending on the outcome of the previous operation. This typically improves partitioning (and hence file count) but can in rare circumstances reportedly also make it worse (but I have not observed this myself).

AQE is enabled by default, but there's a spark profile you can apply to your transform if you want to try disabling it.

Bucketing & partitioning

Bucketing and partitioning are somewhat tangential to this discussion, as they are mainly about particular ways to lay out the data to optimize for reading it. Neither of these techniques currently work with incremental pipelines.

A common mistake is to write out a dataset partitioned by a column that's high-cardinality, such as a timestamp. In a dataset with 10 million unique timestamp, this will result in (at least) 10 million files in the output dataset.

In these cases the transform should be fixed and the old transaction (which contains millions of files) should be deleted by applying retention.

Other hacks

Other hacks to compactify datasets are possible, such as creating "loop-back" transforms that read the previous output and repartition it, or to manually open transactions on the dataset to re-write it.

These are very hacky and in my view undesirable however, and should be avoided. Background compactification mostly solves this problem in a much more elegant, reliable and less hacky manner nowadays.

How do I reduce the number of files in my foundry dataset?

Question

2 answers

solution1
2 2020-11-02 18:50:42

solution2
1 2020-11-04 09:39:06

Why reduce the file count?

Why do we end up with many files in a dataset in the first place?

Upon ingest (magritte transformers)

Background compaction

repartition & coalesce

Regular snapshotting

AQE

Bucketing & partitioning

Other hacks

How do I reduce the number of files in my foundry dataset?

Question

2 answers

solution1 2 2020-11-02 18:50:42

solution2 1 2020-11-04 09:39:06

Why reduce the file count?

Why do we end up with many files in a dataset in the first place?

Upon ingest (magritte transformers)

Background compaction

repartition & coalesce

Regular snapshotting

AQE

Bucketing & partitioning

Other hacks

solution1
2 2020-11-02 18:50:42

solution2
1 2020-11-04 09:39:06