简体   繁体   中英

spark shuffle partitions with coalesce

Lets say I have a dataset with 20 partitions when I was going to read some data. Then I do aggregate operation on that dataset , which would make no of partitions to be 200(because of default shuffle partitions size). Now without calling any action on that dataset so far , I apply coalesce on that same data set giving 30 partitions in coalesce operation and then call some spark action on that dataset.

So my question is, how many partitions will be in action while that dataset would be having its aggregate operation ? Will it be 30 partitions(because that was the coalesce partitions given ) only or 200 shuffle partitions ?

Editing to provide more clarification on my question: I understand that coalesce operation in itself will not do shuffle unless we drastically changed no of partitions. I also understand that final dataset will have numPartitions size only , but my question is if I change no of partitions before calling any action on that dataframne , would that resulting action will operate on the final no of partitions we had given(in my case 30) or it will also honor intermediate partitions size that we had given in aggregate operation. So in all, I am mainly looking whether aggregation will be done with 200 partitions and then coalesce will be applied or aggregation will also be performed with 30(in my case) partitions only.

Coalesce

Returns a new SparkDataFrame that has exactly numPartitions partitions. This operation results in a narrow dependency, eg if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions. If a larger number of partitions is requested, it will stay at the current number of partitions.

However, if you're doing a drastic coalesce on a SparkDataFrame, eg to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (eg one node in the case of numPartitions = 1). To avoid this, call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

Coalesce: Shuffle the data into existing number of partitions.

Yes, your final action will operate on partitions generated by coalesce, like in your case it's 30. As we know there is two types of transformation narrow and wide. Narrow transformation don't do shuffling and don't do repartitioning but wide shuffling shuffle the data between node and generate new partition. So if you check coalesce is a wide transformation and it will create a new stage before proceeding for next transformation or action and next stage will work on shuffle partition generated by coalesce. So yes, your actions will going to work on 30 partitions.

https://www.google.com/amp/s/data-flair.training/blogs/spark-rdd-operations-transformations-actions/amp/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM