如何减少我的代工厂数据集中的文件数量？

Question

My Dataset has 20000 files, each very small one.我的数据集有 20000 个文件，每个文件都很小。 How would I reduce the number of files and what would be an optimal number?我将如何减少文件数量以及最佳数量是多少？

Answer 1

The most straightforward way to do this is to explicitly do a repartition() (or coalesce() if the partition count is strictly decreasing from the original number) at the end of your transformation.执行此操作的最直接方法是在转换结束时显式执行repartition() （如果分区计数从原始数字严格减少，则为coalesce() 。

This needs to be the final call before you return / write out your result.这需要是您返回/写出结果之前的最后一次调用。

This would look like:这看起来像：

# ...

@transform_df(
  # ... inputs
)
def my_compute_function(my_inputs):
  # ... my transform logic ...

  df = df.coalesce(500) 
  # df = df.repartition(500) # this also works but is slightly slower than coalesce
  return df

This is the precursor step to something called 'bucketing' for reference.这是称为“bucketing”以供参考的先行步骤。

The optimal number of buckets depends upon the scale of data you are operating with.最佳存储桶数取决于您操作的数据规模。 It is somewhat straightforward to calculate the optimal number of buckets by observing the total size of your dataset on disk after successful build.成功构建后，通过观察磁盘上数据集的总大小来计算最佳存储区数量有些简单。

If your dataset is 128GB in size, you will want to end up with 128MB files at the end, therefore your number of buckets is :如果您的数据集大小为 128GB，您最终会希望得到 128MB 的文件，因此您的存储桶数为：

128 GB * (1000 MB / 1 GB) * (1 file / 128MB) -> 1000 files

NOTE: this isn't an exact calculation since your final dataset size after changing bucket count will be different due to data compression used in Snappy + Parquet write-out.注意：这不是一个精确的计算，因为由于 Snappy + Parquet 写出中使用的数据压缩，更改存储桶计数后的最终数据集大小会有所不同。 You'll notice that the file sizes are slightly different than you anticipated, so you may end up with 1100 or 900 files needed in the above example您会注意到文件大小与您预期的略有不同，因此在上述示例中您可能最终需要 1100 或 900 个文件

Answer 2

Since this is a problem I've had to solve quite a few times, I've decided to write out a more detailed guide with a bunch of different techniques, pros and cons and a raison d'être.由于这是一个我不得不多次解决的问题，因此我决定编写一份更详细的指南，其中包含许多不同的技术、优缺点以及存在的理由。

Why reduce the file count?为什么要减少文件数？

There's a couple good reasons to avoid datasets with many files:避免使用包含许多文件的数据集有几个很好的理由：

Read performance can be worse .读取性能可能会更差。 When the data is fragmented across many small files, performance for applications like contour (Analysis) can seriously suffer, as the executors have to go through the overhead of downloading many small files from the backing filesystem.当数据分散在许多小文件中时，轮廓（分析）等应用程序的性能会受到严重影响，因为执行程序必须承担从后备文件系统下载许多小文件的开销。
If the backing filesystem is HDFS , many small files will increase heap pressure on the hadoop name-nodes and the gossip protocol.如果后备文件系统是 HDFS ，许多小文件会增加 hadoop name-nodes 和 gossip 协议的堆压力。 HDFS does not handle many small files terribly well, as it does not stream/paginate the list of files in the filesystem, but instead constructs messages containing a complete enumeration of all files. HDFS 不能很好地处理许多小文件，因为它不会对文件系统中的文件列表进行流式处理/分页，而是构造包含所有文件的完整枚举的消息。 When you have tens or even hundreds of million of filesystem objects in HDFS, this ends up bumping into the name-node RPC message size limit (which you can increase in the config) and the available heap memory (which you can increase in the config... if you have more memory available.) Inter-node communication becomes slower and slower.当您在 HDFS 中有数千万甚至数亿个文件系统对象时，这最终会遇到名称节点 RPC 消息大小限制（您可以在配置中增加）和可用堆内存（您可以在配置中增加） ...如果您有更多可用内存。）节点间通信变得越来越慢。
Transforms become slower , as (currently even for incremental transforms) the driver thread has to retrieve a complete list of all files in the current view from catalog, as well as metadata and provenance for transactions (which is only tangentially related, but it's not unusual that many files are correlated with many transactions)转换变得更慢，因为（目前甚至对于增量转换）驱动程序线程必须从目录中检索当前视图中所有文件的完整列表，以及事务的元数据和出处（这只是切线相关，但这并不罕见许多文件与许多事务相关）
Transforms can OOM the driver , as the set of files and set of transactions is kept in memory at some points in time.转换可以 OOM 驱动程序，因为文件集和事务集在某些时间点保存在内存中。 This can be solved by assigning a larger memory profile to the driver -- but this increases cost and/or decreases resources available for other pipelines.这可以通过为驱动程序分配更大的内存配置文件来解决——但这会增加成本和/或减少可用于其他管道的资源。

Why do we end up with many files in a dataset in the first place?为什么我们最终会在数据集中得到许多文件？

Ending up with a dataset with many files is typically caused by one of these three reasons:以包含许多文件的数据集结束通常是由以下三个原因之一引起的：

A file ingest that ingests many small files摄取许多小文件的文件摄取
An (ill-behaved) transform that produces many small files.产生许多小文件的（行为不端的）转换。 Each time a wide operation in spark is executed, a shuffling can occur.每次执行 spark 中的广泛操作时，都会发生改组。 For instance when a groupBy is executed (which implies a shuffle), spark will by default choose to repartition the data into 200 new partitions, which is too many for eg an incremental transform.例如，当执行groupBy时（这意味着 shuffle），spark 默认选择将数据重新分区为 200 个新分区，这对于增量转换来说太多了。 A transform can also produce too many output files due to bad partitioning (discussed below).由于分区不当（下面讨论），转换也可能产生过多的输出文件。
A pipeline that runs incrementally and runs frequently.增量运行且频繁运行的管道。 Every time the pipeline runs and processes a (typically small) piece of data, a new transaction is created on each dataset, each of which contains at least one file.每次管道运行并处理一段（通常很小）数据时，都会在每个数据集上创建一个新事务，每个数据集至少包含一个文件。

Next, I'll list all the methods of reducing the file-counts in datasets that I'm aware of, along with their drawbacks and advantages, as well as some characterization when they are applicable.接下来，我将列出我所知道的减少数据集中文件数的所有方法，以及它们的缺点和优点，以及适用时的一些特征。

Upon ingest (magritte transformers)摄入时（magritte 变压器）

One of the best options is to avoid having many files in the first place.最好的选择之一是首先避免拥有许多文件。 When ingesting many files from eg a filesystem-like source, a magritte transformer like the "concatenating transformer" may help to combine many CSV, JSON or XML files into a single one.当从类似文件系统的源中摄取许多文件时，像“连接转换器”这样的 magritte 转换器可能有助于将许多 CSV、JSON 或 XML 文件组合成一个单一的文件。 Concatenating and then applying the gzip transformer is a particularly effective strategy when applicable, as it often reduces the size of XML and similar text formats by 94% or so.在适用时，串联然后应用 gzip 转换器是一种特别有效的策略，因为它通常可以将 XML 和类似文本格式的大小减少 94% 左右。

The major limitation is that to apply this, you need to主要的限制是要应用这个，你需要

have multiple files available whenever the ingest runs (so not as effective for ingests that run very frequently on frequently updating data-sources)每当摄取运行时都有多个文件可用（因此对于在频繁更新数据源上非常频繁地运行的摄取而言效果不佳）
have a data source that provides you with files that can be concatenated有一个数据源，为您提供可以连接的文件

It's possible to zip up many files into fewer files (using a format such as .tar.bz2, .tar.gz, .zip, .rar etc) as well, but this subsequently requires the downstream transform that is aware of this file format and manually unpacks it (an example of this is available in the documentation), as foundry is not able to transparently provide the data within these archives.也可以将许多文件压缩成更少的文件（使用 .tar.bz2、.tar.gz、.zip、.rar 等格式），但这随后需要知道这种文件格式的下游转换并手动解包（文档中提供了一个示例），因为代工厂无法透明地提供这些档案中的数据。 There's no pre-made magritte processor that does this however, and on the occasions that I've applied this technique, I've used bash scripts to perform this task prior to ingestion, which is admittedly less-than-ideal.但是，没有预制的 magritte 处理器可以执行此操作，并且在我应用此技术的情况下，我会在摄取之前使用 bash 脚本来执行此任务，这无疑不太理想。

Background compaction后台压缩

There is a new mechanism in foundry that decouples the dataset that you write to from the dataset that is read from. Foundry 中有一种新机制可以将您写入的数据集与读取的数据集分离。 There is essentially a background job running that shuffles files into an optimized index as you append them, so that reads of the dataset can (mostly) go to this optimized index instead of the (usually somewhat arbitrary) data layout that the writer left behind.本质上有一个后台作业在运行，它在您追加文件时将文件洗牌到优化的索引中，以便数据集的读取（大部分）可以转到这个优化的索引，而不是作者留下的（通常有点随意）数据布局。

This has various benefits (like automatically producing layouts of the data that are optimized for the most common read patterns) one of them being that it can "compactify" your dataset in the background.这具有多种好处（例如自动生成针对最常见读取模式优化的数据布局），其中之一是它可以在后台“压缩”您的数据集。

When reading from such a dataset, your reads essentially hit the index as well as the input dataset (which contains any files that haven't been merged by the background process into the index yet.)从这样的数据集中读取时，您的读取基本上会命中索引以及输入数据集（其中包含尚未被后台进程合并到索引中的任何文件。）

The big advantage is that this happens automatically in the background, and regardless of how messy your data ingestion or transform is, you can simply write out the data (taking no perf hit on write and getting the data to the consumer ASAP) while still ending up with a nicely partitioned dataset with few files (eventually.)最大的优点是这会在后台自动发生，无论您的数据摄取或转换有多混乱，您都可以简单地写出数据（在写入时没有性能命中并尽快将数据提供给消费者），同时仍然结束一个很好的分区数据集，文件很少（最终。）

The major limitation here is that this only works for datasets that are in a format that spark can natively understand, such as parquet, avro, json, csv, ... If you have eg an ingest of arbitrary files, a workaround can be to pack these up into eg parquet before ingestion.这里的主要限制是，这仅适用于 Spark 本身可以理解的格式的数据集，例如 parquet、avro、json、csv ......在摄入前将它们打包成例如镶木地板。 That way foundry can still merge multiple of these parquet files over time.这样，随着时间的推移，铸造厂仍然可以合并多个这些镶木地板文件。

This feature is not quite available to end-users yet (but is planned to be enabled by default for everything.) If you think this is the most desirable solution for one of your pipelines, your palantir POC can kick off a ticket with the team to enable this feature.此功能尚未对最终用户完全可用（但计划默认为所有内容启用。）如果您认为这是您的管道之一最理想的解决方案，您的 palantir POC 可以与团队一起开票启用此功能。

repartition & coalesce重新分配和合并

Coalescing is an operation in spark that can reduce the number of partitions without having a wide dependency (the only such operation in spark).合并是 spark 中的一个操作，它可以减少分区的数量而没有广泛的依赖（spark 中唯一的这样的操作）。 Coalescing is fast, because it minimizes shuffling.合并很快，因为它最大限度地减少了混洗。 How it works exactly has changed over previous spark versions (and there's a lot of conflicting information out there) but it's generally faster than repartition .它的工作方式与以前的 spark 版本相比发生了变化（并且那里有很多相互矛盾的信息），但它通常比repartition快。 However, it comes with a big caveat: It reduces the parallelism of your entire transform .但是，它有一个很大的警告：它减少了整个 transform 的并行度。

Even if you coalesce at the very end right before writing your data, spark will adapt the entire query plan to use fewer partitions throughout , resulting in fewer executors being used, meaning you get less parallelism.即使您在写入数据之前的最后coalesce ，spark 也会调整整个查询计划以使用更少的分区，从而减少使用的执行程序，这意味着您获得的并行度更低。

Repartitioning is similar, but it inserts a full shuffling stage.重新分区是类似的，但它插入了一个完整的洗牌阶段。 This comes at a higher performance cost, but it means the data that comes out of this stage is essentially guaranteed to be well-partitioned (regardless the input).这带来了更高的性能成本，但这意味着从这个阶段出来的数据基本上可以保证分区良好（无论输入如何）。 While repartition is somewhat expensive by itself, it does not suffer from the issue of reducing parallelism throughout the transform.虽然repartition本身有点昂贵，但它不会遇到在整个转换过程中减少并行度的问题。

This means that overall you will typically get better performance with using repartition over coalesce if the amount of data you end up writing out is not that massive, compared to the amount of prior work you do on it, as the ability to process the data on more executors outweighs the drawback of the shuffle in the end.这意味着总体而言，如果您最终写出的数据量与您之前所做的工作量相比不是那么大，那么使用repartition不是coalesce通常会获得更好的性能，因为处理数据的能力更多的执行者最终超过了shuffle的缺点。 From my experience, repartition usually wins out here unless your transforms are very simple.根据我的经验，除非您的转换非常简单，否则repartition通常会在这里胜出。

One particular use-case worth discussing is that of an incremental pipeline.一个值得讨论的特定用例是增量管道。 If your incremental pipeline is relatively straightforward and only does eg mapping and filtering, then doing a coalesce is fine.如果您的增量管道相对简单并且只执行例如映射和过滤，那么进行coalesce就可以了。 However many incremental pipelines do also read snapshot views of very large datasets.然而，许多增量管道也会读取非常大的数据集的快照视图。 For instance an incremental pipeline might receive one new row of data, and read the entire previous output dataset (possibly millions of rows), so see if this row already exists in the output dataset.例如，增量管道可能会接收一行新数据，并读取整个先前的输出数据集（可能数百万行），因此请查看该行是否已存在于输出数据集中。 If it already exists, no row is emitted, if it does not exist, the row is appended.如果它已经存在，则不发出任何行，如果它不存在，则附加该行。 Similar scenarios happen when joining a small piece of incremental data against large static datasets etc.将一小段增量数据与大型静态数据集等连接起来时，会发生类似的情况。

In this scenario, the transform is incremental, but it still benefits from high parallelism, because it still handles large amounts of data.在这种情况下，转换是增量的，但它仍然受益于高并行性，因为它仍然处理大量数据。

My rough guideline is:我的粗略指导方针是：

transform runs as snapshot: repartition to a reasonable number转换作为快照运行： repartition到一个合理的数字
transform runs incrementally and doesn't need high parallelism: coalesce(1)变换以增量方式运行并且不需要高并行度： coalesce(1)
transform runs incrementally but still benefits from parallelism: repartition(1)变换以增量方式运行，但仍然受益于并行性： repartition(1)

If write speed / pipeline latency is highly essential, neither of these options may be acceptable.如果写入速度/流水线延迟非常重要，那么这些选项都不能接受。 In such cases, I would consider background compactification instead.在这种情况下，我会考虑背景压缩。

Regular snapshotting定期快照

As an extension of the previous point, to keep incremental pipelines high-performance, I like to schedule regular snapshots on them, which allows me to repartition the dataset every once in a while, performing what's basically a "compaction".作为前一点的扩展，为了保持增量管道的高性能，我喜欢在它们上安排定期快照，这允许我每隔一段时间重新分区数据集，执行基本上是“压缩”的操作。

I've described a mechanism of how to set this up here: How to force an incremental Foundry Transforms job to build non-incrementally without bumping the semantic version?我已经在这里描述了如何设置它的机制：如何强制增量 Foundry Transforms 作业以非增量方式构建而不影响语义版本？

I would typically schedule a snapshot on eg the weekend.我通常会在周末安排快照。 Throughout the week, each dataset in the pipeline (which might have hundreds of datasets) will accumulate thousands or tens of thousands of transactions & files.在整个星期中，管道中的每个数据集（可能有数百个数据集）将累积数千或数万个事务和文件。 Then over the weekend, as the scheduled snapshot rolls through the pipeline, each dataset will be repartitioned down to, say, a hundred files.然后在周末，随着计划的快照在管道中滚动，每个数据集将被重新分区，例如，一百个文件。

AQE空气质量指数

Somewhat recently, AQE became available in foundry.最近，AQE 在代工厂中可用。 AQE essentially (for the purpose of this discussion) injects coalesce operations into stages where you already have a shuffle operation going on anyway, depending on the outcome of the previous operation. AQE 本质上（出于本次讨论的目的）将coalesce操作注入到您已经进行了混洗操作的阶段，具体取决于前一个操作的结果。 This typically improves partitioning (and hence file count) but can in rare circumstances reportedly also make it worse (but I have not observed this myself).这通常会改善分区（以及文件数量），但据报道在极少数情况下也会使情况变得更糟（但我自己没有观察到这一点）。

AQE is enabled by default, but there's a spark profile you can apply to your transform if you want to try disabling it.默认情况下启用 AQE，但如果您想尝试禁用它，您可以将其应用到转换中。

Bucketing & partitioning分桶和分区

Bucketing and partitioning are somewhat tangential to this discussion, as they are mainly about particular ways to lay out the data to optimize for reading it.分桶和分区与此讨论有些相关，因为它们主要是关于布置数据以优化读取数据的特定方法。 Neither of these techniques currently work with incremental pipelines.这些技术目前都不适用于增量管道。

A common mistake is to write out a dataset partitioned by a column that's high-cardinality, such as a timestamp.一个常见的错误是写出由高基数列分区的数据集，例如时间戳。 In a dataset with 10 million unique timestamp, this will result in (at least) 10 million files in the output dataset.在具有 1000 万个唯一时间戳的数据集中，这将导致（至少）输出数据集中有 1000 万个文件。

In these cases the transform should be fixed and the old transaction (which contains millions of files) should be deleted by applying retention.在这些情况下，应该修复转换，并通过应用保留删除旧事务（包含数百万个文件）。

Other hacks其他黑客

Other hacks to compactify datasets are possible, such as creating "loop-back" transforms that read the previous output and repartition it, or to manually open transactions on the dataset to re-write it.其他压缩数据集的技巧也是可能的，例如创建“环回”转换以读取先前的输出并将其重新分区，或者手动打开数据集上的事务以重新写入它。

These are very hacky and in my view undesirable however, and should be avoided.这些是非常hacky的，但在我看来是不可取的，应该避免。 Background compactification mostly solves this problem in a much more elegant, reliable and less hacky manner nowadays.如今，背景压缩主要以更优雅、更可靠和更少黑客的方式解决了这个问题。

如何减少我的代工厂数据集中的文件数量？

问题描述

2 个解决方案

解决方案1
2 2020-11-02 18:50:42

解决方案2
1 2020-11-04 09:39:06

Why reduce the file count?为什么要减少文件数？

Why do we end up with many files in a dataset in the first place?为什么我们最终会在数据集中得到许多文件？

Upon ingest (magritte transformers)摄入时（magritte 变压器）

Background compaction后台压缩

repartition & coalesce重新分配和合并

Regular snapshotting定期快照

AQE空气质量指数

Bucketing & partitioning分桶和分区

Other hacks其他黑客

如何减少我的代工厂数据集中的文件数量？

问题描述

2 个解决方案

解决方案1 2 2020-11-02 18:50:42

解决方案2 1 2020-11-04 09:39:06

Why reduce the file count?为什么要减少文件数？

Why do we end up with many files in a dataset in the first place?为什么我们最终会在数据集中得到许多文件？

Upon ingest (magritte transformers)摄入时（magritte 变压器）

Background compaction后台压缩

repartition & coalesce重新分配和合并

Regular snapshotting定期快照

AQE空气质量指数

Bucketing & partitioning分桶和分区

Other hacks其他黑客

解决方案1
2 2020-11-02 18:50:42

解决方案2
1 2020-11-04 09:39:06