简体   繁体   English

有没有办法忽略Spark中元素很少的RDD分区上的处理?

[英]Is there a way to omit processing over a RDD partition with few elements in Spark?

I have an RDD and I need to apply a computation on each partition (using .mapPartition ) but only if the current partition of data have more than X elements. 我有一个RDD,并且仅当当前数据分区具有多个X元素时,才需要在每个分区上应用计算(使用.mapPartition )。

Example: The number of elements in each partition of the RDD is: 示例:RDD的每个分区中的元素数为:

80, 9, 0, 0, 0, 3, 60 80、9、0、0、0、3、60

I want to process only over the partitions with more than 50 elements. 我只想处理包含50个以上元素的分区。

Is this even possible? 这有可能吗?

Can be also done lazily without pre-calculating sizes. 也可以懒惰地完成而无需预先计算尺寸。 Filtering to partitions with at least two elements in this example 在此示例中,过滤到包含至少两个元素的分区

import org.apache.spark.Partitioner

object DemoPartitioner extends Partitioner {
  override def numPartitions: Int = 3
  override def getPartition(key: Any): Int = key match {
    case num: Int => num
  }
}

sc.parallelize(Seq((0, "a"), (0, "a"), (0, "a"), (1, "b"), (2, "c"), (2, "c")))
  .partitionBy(DemoPartitioner) // create 3 partitions of sizes 3,1,2
  .mapPartitions { it =>
    val firstElements = it.take(2).toSeq
    if (firstElements.size < 2) {
      Iterator.empty
    } else {
      firstElements.iterator ++ it
    }
  }.foreach(println)

Output: 输出:

(2,c)
(2,c)
(0,a)
(0,a)
(0,a)

So partition 1 with just a single element was skipped 因此,仅包含一个元素的分区1被跳过了

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM