简体   繁体   English

在pyspark中的列上重新分区如何影响分区数量?

[英]How does repartitioning on a column in pyspark affect the number of partitions?

I have a dataframe having a million records. 我有一个具有一百万条记录的数据框。 It looks like this - 看起来像这样-

df.show()

+--------------------+--------------------++-------------
|            feature1|            feature2| domain    |
+--------------------+--------------------++-------------
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain1   | 
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain2   |
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain1   |
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain2   |
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain1   |

Ideal partition size is 128 MB in spark and let's suppose the domain column has two unique values (domain1 and domain2), Considering this I have two questions - 理想的分区大小为128 MB,假设domain列具有两个唯一值(domain1和domain2),考虑到这一点,我有两个问题-

  1. If I do df.repartition("domain") and if one partition is not able to accommodate all the data for a particular domain key, will the application fail or will it automatically create partitions as suited depending on the data? 如果我执行df.repartition("domain") ,并且一个分区不能容纳特定域密钥的所有数据,那么应用程序会失败还是会根据数据自动创建合适的分区?

  2. Suppose in the above data repartitioning has already happened based on the domain key so there will be two partitions (unique keys are domain1 and domain2). 假设在上面的数据中已经基于域密钥进行了重新分区,所以将存在两个分区(唯一的密钥是domain1和domain2)。 Now let's say domain1 and domain2 are repeated 1000000 times and I am going to do self-join based on the domain. 现在,假设domain1和domain2重复了1000000次,我将基于该域进行自联接。 So for each domain I will be getting approx 10^12 records. 因此,对于每个域,我将获得大约10 ^ 12条记录。 Considering that we have two partitions and the number of partitions doesn't change during the joins, will the two new partitions be able to handle 1000000 records? 考虑到我们有两个分区,并且在连接期间分区的数量没有变化,两个新分区是否能够处理1000000条记录?

The answer depends on the size of your data. 答案取决于数据的大小。 When one partition is not able to hold all the data belonging to one partition value (eg domain1 ), more partitions will be created, at most spark.sql.shuffle.partitions many. 当一个分区不能保存属于一个分区值的所有数据(例如domain1 )时,将创建更多分区,最多spark.sql.shuffle.partitions个分区。 If your data is too large, ie one partition would exceed the limit of 2GB (see also Why does Spark RDD partition has 2GB limit for HDFS? for an explanation on that), the repartitioning will cause an OutOfMemoryError . 如果您的数据太大,即一个分区将超过2GB的限制(有关说明,另请参见为什么Spark RDD分区HDFS的限制为2GB? ),重新分区将导致OutOfMemoryError
Just as a side note to deliver a complete answer: Being able to fit the data into one partition does not necessarily entail that only one partition is generated for a partition value. 正如提供完整答案的旁注所示:能够将数据放入一个分区不一定意味着一个分区值只生成一个分区。 This depends - among others - on the number of executors and how the data was partitioned before. 除其他因素外,这取决于执行程序的数量以及之前如何对数据进行分区。 Spark will try to avoid unnecessary shuffling and could therefore generate several partitions for one partition value. Spark将尝试避免不必要的改组,因此可能会为一个分区值生成多个分区。

Thus, to prevent the job from failing you should adjust spark.sql.shuffle.partitions or pass the desired number of partitions to repartition together with the partition column. 因此,为防止作业失败,您应该调整spark.sql.shuffle.partitions或将所需数量的分区与partition列一起进行repartition分区。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM