[英]Spark - Number of buckets for a partitioned table
Given that Hive and Spark's bucketing are quite different (both use different algorithm for bucketing, the resultant files generated are different (in Hive, number of buckets = number of files but in Spark it's not the case etc ), majority of the guidelines are available about Hive and not about Spark, how can one decide the right number of buckets for a table that's going to be processed by Spark. A few questions:鉴于 Hive 和 Spark 的分桶完全不同(两者都使用不同的分桶算法,生成的文件不同(在 Hive 中,桶数 = 文件数,但在 Spark 中并非如此等),大多数指南都可用关于 Hive 而不是关于 Spark,如何确定将由 Spark 处理的表的正确桶数。几个问题:
Consolidated view on the number of buckets will be quite helpful.对桶数的综合视图将非常有帮助。
Given that Hive and Spark's bucketing are quite different (both use different algorithm for bucketing, the resultant files generated are different (in Hive, number of buckets = number of files but in Spark it's not the case etc ), majority of the guidelines are available about Hive and not about Spark, how can one decide the right number of buckets for a table that's going to be processed by Spark. A few questions:鉴于 Hive 和 Spark 的分桶完全不同(两者都使用不同的分桶算法,生成的文件不同(在 Hive 中,桶数 = 文件数,但在 Spark 中并非如此等),大多数指南都可用关于 Hive 而不是关于 Spark,如何确定将由 Spark 处理的表的正确桶数。几个问题:
Consolidated view on the number of buckets will be quite helpful.对桶数的综合视图将非常有帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.