简体繁体 English

加载 Hive 表时 Spark 创建了多少个分区

[英]How many partitions Spark creates when loading a Hive table

原文 2020-04-02 12:42:04 0 3 apache-spark/ hadoop/ pyspark/ apache-spark-sql

Even if it is a Hive table or an HDFS file, when Spark reads the data and creates a dataframe, I was thinking that the number of partitions in the RDD/dataframe will be equal to the number of partfiles in HDFS.即使是 Hive 表或 HDFS 文件，当 Spark 读取数据并创建数据帧时，我认为 RDD/数据帧中的分区数将等于 HDFS 中的部分文件数。 But when I did a test with Hive external table, I could see that the number was coming different than the number of part-files .The number of partitions in a dataframe was 119. The table was a Hive partitioned table with 150 partfiles in it, with a minimum size of a file 30 MB and max size is 118 MB.但是当我对 Hive 外部表进行测试时，我可以看到该数字与部分文件的数量不同。数据帧中的分区数为 119。该表是一个 Hive 分区表，其中包含 150 个部分文件, 文件的最小大小为 30 MB，最大大小为 118 MB。 So then what decides the number of partitions?那么是什么决定了分区的数量呢？

3 个解决方案

You can control how many bytes Spark packs into a single partition by setting spark.sql.files.maxPartitionBytes .您可以通过设置spark.sql.files.maxPartitionBytes来控制 Spark 打包到单个分区中的字节spark.sql.files.maxPartitionBytes 。 The default value is 128 MB, see Spark Tuning .默认值为 128 MB，请参阅Spark Tuning 。

I think this link does answers my question .The number of partitions depends on the number of splits split and the splits depends on the hadoop inputformat .我认为这个链接确实回答了我的问题。分区的数量取决于拆分的数量，拆分取决于 hadoop inputformat。 https://intellipaat.com/community/7671/how-does-spark-partition-ing-work-on-files-in-hdfs https://intellipaat.com/community/7671/how-does-spark-partition-ing-work-on-files-in-hdfs

With the block size of each block as 128MB.每个块的块大小为128MB。 Spark will read the data. Spark 将读取数据。 Say if your hive table size was aprrox 14.8 GB then it will divide the hive table data into 128 MB blocks and will result in 119 Partitions.假设您的 hive 表大小大约为 14.8 GB，那么它将 hive 表数据划分为 128 MB 块，并产生 119 个分区。

On the other hand your hive table is partitioned so the partition column has 150 unique values.另一方面，您的配置单元表已分区，因此分区列具有 150 个唯一值。

So number of part files in hive and number of partitions in spark are not linked.因此，hive 中的部分文件数和 spark 中的分区数没有关联。