[英]Is there way how to shuffle collection in Spark
I need to shuffle text file with 2.2*10^9 lines. 我需要使用2.2 * 10 ^ 9行对文本文件进行混洗。 Is there way how I can load it in spark, then shuffle each partition in parallel(for me it is enough to shuffle within scope of partition) and then spill it back to the file? 有什么办法可以将其加载到spark中,然后并行地对每个分区进行混洗(对我来说,足以在分区范围内进行混洗)然后将其溢出回到文件中?
To shuffle only within partitions you can do something like this: 要仅在分区内随机播放,您可以执行以下操作:
rdd.mapPartitions(new scala.util.Random().shuffle(_))
To shuffle a whole RDD: 随机播放整个RDD:
rdd.mapPartitions(iter => {
val rng = new scala.util.Random()
iter.map((rng.nextInt, _))
}).partitionBy(new HashPartitioner(rdd.partitions.size)).values
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.