简体   繁体   English

Spark是否处理数据改组?

[英]Does spark handle data shuffling?

I have an input A which I convert into an rdd X spread across the cluster. 我有我转换成整个集群的RDD X传播所输入音频。

I perform certain operations on it. 我对其执行某些操作。

Then I do .repartition(1) on the output rdd. 然后在输出rdd上执行.repartition(1)

Will my output rdd be in the same order that input A . 我的输出rdd与输入A的顺序相同吗?

Does spark handle this automatically? Spark会自动处理吗? If yes, then how? 如果是,那怎么办?

The documentation doesn't guarantee that order will be kept, so you can assume it won't be. 文档不保证将保留订单,因此您可以假设不会保留。 If you look at the implementation, you'll see it certainly won't be (unless your original RDD already has 1 partition for some reason): repartition calls coalesce(shuffle = true) , which 如果您看一下实现,您肯定会发现它不会(除非出于某种原因您的原始RDD已经具有1个分区): repartition调用coalesce(shuffle = true) ,这

Distributes elements evenly across output partitions, starting from a random partition. 从随机分区开始,在输出分区之间均匀分配元素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM