[英]Repartioning parquet file dask
I want to understand a few things about partioning a parquet on Dask.我想了解一些关于在 Dask 上分割镶木地板的事情。
When I do it in a.csv file, the chunksize works as intended, doing 30 partitions based on 50 mb chunks.当我在 a.csv 文件中执行此操作时,块大小按预期工作,基于 50 mb 块执行 30 个分区。
When I try to do it the same logic through the read_parquet, none partition is created, and when I force this with repartition(partition_size='50mb'), it create 109 partitions.当我尝试通过 read_parquet 执行相同的逻辑时,不会创建任何分区,当我使用 repartition(partition_size='50mb') 强制执行此操作时,它会创建 109 个分区。
Can someone explain to me why parquet doesn't seems to work at the same way like.csv when doing chunksizes?有人可以向我解释为什么镶木地板在进行块大小时似乎不像.csv 那样工作吗?
In CSV, the fundamental, non-splittable chunk of data is one row, usually the bytes between one \n character and the subsequent one.在 CSV 中,基本的、不可分割的数据块是一行,通常是一个 \n 字符和下一个字符之间的字节。 This bytes chunk size is typically small.
这个字节块大小通常很小。 When you load data with dask, it reads from a given offset to the next \n to be able to read an exact number of rows.
当您使用 dask 加载数据时,它会从给定的偏移量读取到下一个 \n 以便能够读取准确的行数。 You would find, if you made the chunk size too small, that some partitions would contain no data.
您会发现,如果您将块大小设置得太小,则某些分区将不包含数据。
Parquet is not structured like this. Parquet 的结构不是这样的。 Its fundamental non-splittable chunk is the "row-group", and there is often just one row group per data file.
它的基本不可拆分块是“行组”,每个数据文件通常只有一个行组。 This is done for efficiency: encoding and compressing a whole row group's worth of data in one block will give maximum read throughput.
这样做是为了提高效率:在一个块中编码和压缩整个行组的数据将提供最大的读取吞吐量。 Furthermore, because of the encoding and compression, it's much harder for dask to guess how big a piece of a dataset will be as an in-memory pandas dataframe, but it can be many times bigger.
此外,由于编码和压缩,dask 很难猜测一个数据集的内存大小 pandas dataframe,但它可以大很多倍。
A row group could easily be >>100MB in size.一个行组的大小很容易达到 >>100MB。 In fact, this is generally recommended, as smaller pieces will have a higher fraction of their processing time in overhead and latency.
事实上,这通常是推荐的,因为较小的部分在开销和延迟方面的处理时间会更高。
To summarize总结
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.