简体繁体 English

Spark - 分区表的桶数

[英]Spark - Number of buckets for a partitioned table

原文 2019-11-14 03:35:17 8 1 apache-spark/ hive/ pyspark/ apache-spark-sql/ amazon-emr

Given that Hive and Spark's bucketing are quite different (both use different algorithm for bucketing, the resultant files generated are different (in Hive, number of buckets = number of files but in Spark it's not the case etc ), majority of the guidelines are available about Hive and not about Spark, how can one decide the right number of buckets for a table that's going to be processed by Spark. A few questions:鉴于 Hive 和 Spark 的分桶完全不同（两者都使用不同的分桶算法，生成的文件不同（在 Hive 中，桶数 = 文件数，但在 Spark 中并非如此等），大多数指南都可用关于 Hive 而不是关于 Spark，如何确定将由 Spark 处理的表的正确桶数。几个问题：

Some guidelines state that one should calculate number of buckets based on the size of a table.一些指南 state 应该根据表的大小计算存储桶的数量。 However, if a table is transactional in nature whose size is expected to grow over time, how the number of buckets be created?但是，如果一个表本质上是事务性的，其大小预计会随着时间的推移而增长，那么如何创建存储桶的数量？
One can control the number of output files (in HDFS or S3) via coalesce or repartition as performance of Spark is highly impacted on this (few files of large size (at least greater than the block size) are much better than a lot of small files).可以通过合并或重新分区来控制 output 文件（在 HDFS 或 S3 中）的数量，因为 Spark 的性能对此有很大影响（很少有大文件（至少大于块大小）比很多小文件要好得多文件）。 Thus if I output eg 10 files per partition and define number of buckets to a higher value eg 100, will it be the right approach and will it give the desired benefits expected out of bucketing?因此，如果我 output 例如每个分区 10 个文件并将存储桶的数量定义为更高的值（例如 100），这是否是正确的方法，它是否会提供预期的分桶收益？
It's also suggested that the number of spark.sql.shuffle.partitions should be equal to the number of buckets.还建议 spark.sql.shuffle.partitions 的数量应该等于桶的数量。 Is that true?真的吗？

Consolidated view on the number of buckets will be quite helpful.对桶数的综合视图将非常有帮助。

1 个解决方案

Some guidelines state that one should calculate number of buckets based on the size of a table.一些指南 state 应该根据表的大小计算存储桶的数量。 However, if a table is transactional in nature whose size is expected to grow over time, how the number of buckets be created?但是，如果一个表本质上是事务性的，其大小预计会随着时间的推移而增长，那么如何创建存储桶的数量？
One can control the number of output files (in HDFS or S3) via coalesce or repartition as performance of Spark is highly impacted on this (few files of large size (at least greater than the block size) are much better than a lot of small files).可以通过合并或重新分区来控制 output 文件（在 HDFS 或 S3 中）的数量，因为 Spark 的性能对此有很大影响（很少有大文件（至少大于块大小）比很多小文件要好得多文件）。 Thus if I output eg 10 files per partition and define number of buckets to a higher value eg 100, will it be the right approach and will it give the desired benefits expected out of bucketing?因此，如果我 output 例如每个分区 10 个文件并将存储桶的数量定义为更高的值（例如 100），这是否是正确的方法，它是否会提供预期的分桶收益？
It's also suggested that the number of spark.sql.shuffle.partitions should be equal to the number of buckets.还建议 spark.sql.shuffle.partitions 的数量应该等于桶的数量。 Is that true?真的吗？