简体繁体 English

重新分区镶木地板文件 dask

[英]Repartioning parquet file dask

原文 2021-11-21 10:11:14 1 1 python/ dask/ parquet

I want to understand a few things about partioning a parquet on Dask.我想了解一些关于在 Dask 上分割镶木地板的事情。

When I do it in a.csv file, the chunksize works as intended, doing 30 partitions based on 50 mb chunks.当我在 a.csv 文件中执行此操作时，块大小按预期工作，基于 50 mb 块执行 30 个分区。

When I try to do it the same logic through the read_parquet, none partition is created, and when I force this with repartition(partition_size='50mb'), it create 109 partitions.当我尝试通过 read_parquet 执行相同的逻辑时，不会创建任何分区，当我使用 repartition(partition_size='50mb') 强制执行此操作时，它会创建 109 个分区。

Can someone explain to me why parquet doesn't seems to work at the same way like.csv when doing chunksizes?有人可以向我解释为什么镶木地板在进行块大小时似乎不像.csv 那样工作吗？

1 个解决方案

In CSV, the fundamental, non-splittable chunk of data is one row, usually the bytes between one \n character and the subsequent one.在 CSV 中，基本的、不可分割的数据块是一行，通常是一个 \n 字符和下一个字符之间的字节。 This bytes chunk size is typically small.这个字节块大小通常很小。 When you load data with dask, it reads from a given offset to the next \n to be able to read an exact number of rows.当您使用 dask 加载数据时，它会从给定的偏移量读取到下一个 \n 以便能够读取准确的行数。 You would find, if you made the chunk size too small, that some partitions would contain no data.您会发现，如果您将块大小设置得太小，则某些分区将不包含数据。

Parquet is not structured like this. Parquet 的结构不是这样的。 Its fundamental non-splittable chunk is the "row-group", and there is often just one row group per data file.它的基本不可拆分块是“行组”，每个数据文件通常只有一个行组。 This is done for efficiency: encoding and compressing a whole row group's worth of data in one block will give maximum read throughput.这样做是为了提高效率：在一个块中编码和压缩整个行组的数据将提供最大的读取吞吐量。 Furthermore, because of the encoding and compression, it's much harder for dask to guess how big a piece of a dataset will be as an in-memory pandas dataframe, but it can be many times bigger.此外，由于编码和压缩，dask 很难猜测一个数据集的内存大小 pandas dataframe，但它可以大很多倍。

A row group could easily be >>100MB in size.一个行组的大小很容易达到 >>100MB。 In fact, this is generally recommended, as smaller pieces will have a higher fraction of their processing time in overhead and latency.事实上，这通常是推荐的，因为较小的部分在开销和延迟方面的处理时间会更高。

To summarize总结

dask will not split a parquet dataset beyond the partitioning within the data files dask 不会将 parquet 数据集拆分到数据文件中的分区之外
that partition size might be many times larger in memory than on disk, so repartitioning after load may result in many partitions memory 中的分区大小可能比磁盘上的大很多倍，因此加载后重新分区可能会导致许多分区
these are tradeoffs required to make parquet as fast and space-efficient as it is这些是使镶木地板尽可能快速和节省空间所需的权衡