My Dataset has 20000 files, each very small one. How would I reduce the number of files and what would be an optimal number?
The most straightforward way to do this is to explicitly do a repartition()
(or coalesce()
if the partition count is strictly decreasing from the original number) at the end of your transformation.
This needs to be the final call before you return / write out your result.
This would look like:
# ...
@transform_df(
# ... inputs
)
def my_compute_function(my_inputs):
# ... my transform logic ...
df = df.coalesce(500)
# df = df.repartition(500) # this also works but is slightly slower than coalesce
return df
This is the precursor step to something called 'bucketing' for reference.
The optimal number of buckets depends upon the scale of data you are operating with. It is somewhat straightforward to calculate the optimal number of buckets by observing the total size of your dataset on disk after successful build.
If your dataset is 128GB in size, you will want to end up with 128MB files at the end, therefore your number of buckets is :
128 GB * (1000 MB / 1 GB) * (1 file / 128MB) -> 1000 files
NOTE: this isn't an exact calculation since your final dataset size after changing bucket count will be different due to data compression used in Snappy + Parquet write-out. You'll notice that the file sizes are slightly different than you anticipated, so you may end up with 1100 or 900 files needed in the above example
Since this is a problem I've had to solve quite a few times, I've decided to write out a more detailed guide with a bunch of different techniques, pros and cons and a raison d'être.
There's a couple good reasons to avoid datasets with many files:
Ending up with a dataset with many files is typically caused by one of these three reasons:
groupBy
is executed (which implies a shuffle), spark will by default choose to repartition the data into 200 new partitions, which is too many for eg an incremental transform. A transform can also produce too many output files due to bad partitioning (discussed below).Next, I'll list all the methods of reducing the file-counts in datasets that I'm aware of, along with their drawbacks and advantages, as well as some characterization when they are applicable.
One of the best options is to avoid having many files in the first place. When ingesting many files from eg a filesystem-like source, a magritte transformer like the "concatenating transformer" may help to combine many CSV, JSON or XML files into a single one. Concatenating and then applying the gzip transformer is a particularly effective strategy when applicable, as it often reduces the size of XML and similar text formats by 94% or so.
The major limitation is that to apply this, you need to
It's possible to zip up many files into fewer files (using a format such as .tar.bz2, .tar.gz, .zip, .rar etc) as well, but this subsequently requires the downstream transform that is aware of this file format and manually unpacks it (an example of this is available in the documentation), as foundry is not able to transparently provide the data within these archives. There's no pre-made magritte processor that does this however, and on the occasions that I've applied this technique, I've used bash scripts to perform this task prior to ingestion, which is admittedly less-than-ideal.
There is a new mechanism in foundry that decouples the dataset that you write to from the dataset that is read from. There is essentially a background job running that shuffles files into an optimized index as you append them, so that reads of the dataset can (mostly) go to this optimized index instead of the (usually somewhat arbitrary) data layout that the writer left behind.
This has various benefits (like automatically producing layouts of the data that are optimized for the most common read patterns) one of them being that it can "compactify" your dataset in the background.
When reading from such a dataset, your reads essentially hit the index as well as the input dataset (which contains any files that haven't been merged by the background process into the index yet.)
The big advantage is that this happens automatically in the background, and regardless of how messy your data ingestion or transform is, you can simply write out the data (taking no perf hit on write and getting the data to the consumer ASAP) while still ending up with a nicely partitioned dataset with few files (eventually.)
The major limitation here is that this only works for datasets that are in a format that spark can natively understand, such as parquet, avro, json, csv, ... If you have eg an ingest of arbitrary files, a workaround can be to pack these up into eg parquet before ingestion. That way foundry can still merge multiple of these parquet files over time.
This feature is not quite available to end-users yet (but is planned to be enabled by default for everything.) If you think this is the most desirable solution for one of your pipelines, your palantir POC can kick off a ticket with the team to enable this feature.
Coalescing is an operation in spark that can reduce the number of partitions without having a wide dependency (the only such operation in spark). Coalescing is fast, because it minimizes shuffling. How it works exactly has changed over previous spark versions (and there's a lot of conflicting information out there) but it's generally faster than repartition
. However, it comes with a big caveat: It reduces the parallelism of your entire transform .
Even if you coalesce
at the very end right before writing your data, spark will adapt the entire query plan to use fewer partitions throughout , resulting in fewer executors being used, meaning you get less parallelism.
Repartitioning is similar, but it inserts a full shuffling stage. This comes at a higher performance cost, but it means the data that comes out of this stage is essentially guaranteed to be well-partitioned (regardless the input). While repartition
is somewhat expensive by itself, it does not suffer from the issue of reducing parallelism throughout the transform.
This means that overall you will typically get better performance with using repartition
over coalesce
if the amount of data you end up writing out is not that massive, compared to the amount of prior work you do on it, as the ability to process the data on more executors outweighs the drawback of the shuffle in the end. From my experience, repartition
usually wins out here unless your transforms are very simple.
One particular use-case worth discussing is that of an incremental pipeline. If your incremental pipeline is relatively straightforward and only does eg mapping and filtering, then doing a coalesce
is fine. However many incremental pipelines do also read snapshot views of very large datasets. For instance an incremental pipeline might receive one new row of data, and read the entire previous output dataset (possibly millions of rows), so see if this row already exists in the output dataset. If it already exists, no row is emitted, if it does not exist, the row is appended. Similar scenarios happen when joining a small piece of incremental data against large static datasets etc.
In this scenario, the transform is incremental, but it still benefits from high parallelism, because it still handles large amounts of data.
My rough guideline is:
repartition
to a reasonable numbercoalesce(1)
repartition(1)
If write speed / pipeline latency is highly essential, neither of these options may be acceptable. In such cases, I would consider background compactification instead.
As an extension of the previous point, to keep incremental pipelines high-performance, I like to schedule regular snapshots on them, which allows me to repartition the dataset every once in a while, performing what's basically a "compaction".
I've described a mechanism of how to set this up here: How to force an incremental Foundry Transforms job to build non-incrementally without bumping the semantic version?
I would typically schedule a snapshot on eg the weekend. Throughout the week, each dataset in the pipeline (which might have hundreds of datasets) will accumulate thousands or tens of thousands of transactions & files. Then over the weekend, as the scheduled snapshot rolls through the pipeline, each dataset will be repartitioned down to, say, a hundred files.
Somewhat recently, AQE became available in foundry. AQE essentially (for the purpose of this discussion) injects coalesce
operations into stages where you already have a shuffle operation going on anyway, depending on the outcome of the previous operation. This typically improves partitioning (and hence file count) but can in rare circumstances reportedly also make it worse (but I have not observed this myself).
AQE is enabled by default, but there's a spark profile you can apply to your transform if you want to try disabling it.
Bucketing and partitioning are somewhat tangential to this discussion, as they are mainly about particular ways to lay out the data to optimize for reading it. Neither of these techniques currently work with incremental pipelines.
A common mistake is to write out a dataset partitioned by a column that's high-cardinality, such as a timestamp. In a dataset with 10 million unique timestamp, this will result in (at least) 10 million files in the output dataset.
In these cases the transform should be fixed and the old transaction (which contains millions of files) should be deleted by applying retention.
Other hacks to compactify datasets are possible, such as creating "loop-back" transforms that read the previous output and repartition it, or to manually open transactions on the dataset to re-write it.
These are very hacky and in my view undesirable however, and should be avoided. Background compactification mostly solves this problem in a much more elegant, reliable and less hacky manner nowadays.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.