简体繁体中英

spark shuffle partitions with coalesce

原文 2020-10-09 14:45:58 2 2 apache-spark

Lets say I have a dataset with 20 partitions when I was going to read some data. Then I do aggregate operation on that dataset , which would make no of partitions to be 200(because of default shuffle partitions size). Now without calling any action on that dataset so far , I apply coalesce on that same data set giving 30 partitions in coalesce operation and then call some spark action on that dataset.

So my question is, how many partitions will be in action while that dataset would be having its aggregate operation ? Will it be 30 partitions(because that was the coalesce partitions given ) only or 200 shuffle partitions ?

Editing to provide more clarification on my question: I understand that coalesce operation in itself will not do shuffle unless we drastically changed no of partitions. I also understand that final dataset will have numPartitions size only , but my question is if I change no of partitions before calling any action on that dataframne , would that resulting action will operate on the final no of partitions we had given(in my case 30) or it will also honor intermediate partitions size that we had given in aggregate operation. So in all, I am mainly looking whether aggregation will be done with 200 partitions and then coalesce will be applied or aggregation will also be performed with 30(in my case) partitions only.

2 answers

Coalesce

Returns a new SparkDataFrame that has exactly numPartitions partitions. This operation results in a narrow dependency, eg if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions. If a larger number of partitions is requested, it will stay at the current number of partitions.

However, if you're doing a drastic coalesce on a SparkDataFrame, eg to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (eg one node in the case of numPartitions = 1). To avoid this, call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

https://spark.apache.org/docs/2.2.1/api/R/coalesce.html

Coalesce: Shuffle the data into existing number of partitions.

https://medium.com/@mrpowers/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4#.36o8a7b5j

Yes, your final action will operate on partitions generated by coalesce, like in your case it's 30. As we know there is two types of transformation narrow and wide. Narrow transformation don't do shuffling and don't do repartitioning but wide shuffling shuffle the data between node and generate new partition. So if you check coalesce is a wide transformation and it will create a new stage before proceeding for next transformation or action and next stage will work on shuffle partition generated by coalesce. So yes, your actions will going to work on 30 partitions.

https://www.google.com/amp/s/data-flair.training/blogs/spark-rdd-operations-transformations-actions/amp/

Spark partitions size on coalesce

Spark Coalesce More Partitions

Will Spark Coalesce perform Shuffle

Spark coalesce not reducing partitions count

Reduce Partitions by COALESCE in Spark SQL

Difference in Spark SQL Shuffle partitions

Can Coalesce increase partitions of Spark DataFrame

Spark: increase number of partitions without causing a shuffle?

spark.sql.shuffle.partitions of 200 default partitions conundrum

Setting number of shuffle partitions per shuffle in the same Spark job

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Spark partitions size on coalesce Spark Coalesce More Partitions Will Spark Coalesce perform Shuffle Spark coalesce not reducing partitions count Reduce Partitions by COALESCE in Spark SQL Difference in Spark SQL Shuffle partitions Can Coalesce increase partitions of Spark DataFrame Spark: increase number of partitions without causing a shuffle? spark.sql.shuffle.partitions of 200 default partitions conundrum Setting number of shuffle partitions per shuffle in the same Spark job

Related Tags

spark shuffle partitions with coalesce

Question

2 answers

solution1
0 2020-10-09 15:19:29

solution2
0 2020-10-10 08:26:49

spark shuffle partitions with coalesce

Question

2 answers

solution1 0 2020-10-09 15:19:29

solution2 0 2020-10-10 08:26:49

solution1
0 2020-10-09 15:19:29

solution2
0 2020-10-10 08:26:49