简体繁体 English

Spark是否处理数据改组？

[英]Does spark handle data shuffling?

原文 2016-09-16 09:06:59 1 1 scala/ apache-spark

I have an input A which I convert into an rdd X spread across the cluster. 我有我转换成整个集群的RDD X传播所输入的音频。

I perform certain operations on it. 我对其执行某些操作。

Then I do .repartition(1) on the output rdd. 然后在输出rdd上执行.repartition(1) 。

Will my output rdd be in the same order that input A . 我的输出rdd与输入A的顺序相同吗？

Does spark handle this automatically? Spark会自动处理吗？ If yes, then how? 如果是，那怎么办？

1 个解决方案

The documentation doesn't guarantee that order will be kept, so you can assume it won't be. 文档不保证将保留订单，因此您可以假设不会保留。 If you look at the implementation, you'll see it certainly won't be (unless your original RDD already has 1 partition for some reason): repartition calls coalesce(shuffle = true) , which 如果您看一下实现，您肯定会发现它不会（除非出于某种原因您的原始RDD已经具有1个分区）： repartition调用coalesce(shuffle = true) ，这

Distributes elements evenly across output partitions, starting from a random partition. 从随机分区开始，在输出分区之间均匀分配元素。

插入前火花混洗数据 - Spark shuffling data before insert

避免在Spark中使用ReduceByKey进行混洗 - Avoid Shuffling with ReduceByKey in Spark

Spark如何处理涉及JDBC数据源的故障场景？ - How does Spark handle failure scenarios involving JDBC data source?

Spark是否处理资源管理？ - Does Spark handle resource management?

检查点/持久化/改组似乎并没有像“学习火花”一书中详述的那样“短路”rdd的谱系 - checkpointing / persisting / shuffling does not seem to 'short circuit' the lineage of an rdd as detailed in 'learning spark' book

spark rdd的数据中出现分隔符如何处理 - How to handle if delimiter appears in data in spark rdd

在Spark Streaming中处理太晚的数据 - Handle Too Late data in Spark Streaming

如何在Spark中处理大参考数据 - How to handle big reference data in Spark

如何处理外连接的spark数据框中的数据偏斜 - How to handle data skew in the spark data frame for outer join

在制作商品地图时如何减少Spark的改组和时间？ - How to reduce shuffling and time taken by Spark while making a map of items?

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 插入前火花混洗数据 - Spark shuffling data before insert 避免在Spark中使用ReduceByKey进行混洗 - Avoid Shuffling with ReduceByKey in Spark Spark如何处理涉及JDBC数据源的故障场景？ - How does Spark handle failure scenarios involving JDBC data source? Spark是否处理资源管理？ - Does Spark handle resource management? 检查点/持久化/改组似乎并没有像“学习火花”一书中详述的那样“短路”rdd的谱系 - checkpointing / persisting / shuffling does not seem to 'short circuit' the lineage of an rdd as detailed in 'learning spark' book spark rdd的数据中出现分隔符如何处理 - How to handle if delimiter appears in data in spark rdd 在Spark Streaming中处理太晚的数据 - Handle Too Late data in Spark Streaming 如何在Spark中处理大参考数据 - How to handle big reference data in Spark 如何处理外连接的spark数据框中的数据偏斜 - How to handle data skew in the spark data frame for outer join 在制作商品地图时如何减少Spark的改组和时间？ - How to reduce shuffling and time taken by Spark while making a map of items?

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM