简体   繁体   English

有没有办法在Spark中洗牌集合

[英]Is there way how to shuffle collection in Spark

I need to shuffle text file with 2.2*10^9 lines. 我需要使用2.2 * 10 ^ 9行对文本文件进行混洗。 Is there way how I can load it in spark, then shuffle each partition in parallel(for me it is enough to shuffle within scope of partition) and then spill it back to the file? 有什么办法可以将其加载到spark中,然后并行地对每个分区进行混洗(对我来说,足以在分区范围内进行混洗)然后将其溢出回到文件中?

To shuffle only within partitions you can do something like this: 要仅在分区内随机播放,您可以执行以下操作:

rdd.mapPartitions(new scala.util.Random().shuffle(_))

To shuffle a whole RDD: 随机播放整个RDD:

rdd.mapPartitions(iter => {
  val rng = new scala.util.Random()
  iter.map((rng.nextInt, _))
}).partitionBy(new HashPartitioner(rdd.partitions.size)).values

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM