有没有办法忽略Spark中元素很少的RDD分区上的处理？

Question

I have an RDD and I need to apply a computation on each partition (using .mapPartition ) but only if the current partition of data have more than X elements. 我有一个RDD，并且仅当当前数据分区具有多个X元素时，才需要在每个分区上应用计算（使用.mapPartition ）。

Example: The number of elements in each partition of the RDD is: 示例：RDD的每个分区中的元素数为：

80, 9, 0, 0, 0, 3, 60 80、9、0、0、0、3、60

I want to process only over the partitions with more than 50 elements. 我只想处理包含50个以上元素的分区。

Is this even possible? 这有可能吗？

Answer 1

Can be also done lazily without pre-calculating sizes. 也可以懒惰地完成而无需预先计算尺寸。 Filtering to partitions with at least two elements in this example 在此示例中，过滤到包含至少两个元素的分区

import org.apache.spark.Partitioner

object DemoPartitioner extends Partitioner {
  override def numPartitions: Int = 3
  override def getPartition(key: Any): Int = key match {
    case num: Int => num
  }
}

sc.parallelize(Seq((0, "a"), (0, "a"), (0, "a"), (1, "b"), (2, "c"), (2, "c")))
  .partitionBy(DemoPartitioner) // create 3 partitions of sizes 3,1,2
  .mapPartitions { it =>
    val firstElements = it.take(2).toSeq
    if (firstElements.size < 2) {
      Iterator.empty
    } else {
      firstElements.iterator ++ it
    }
  }.foreach(println)

Output: 输出：

(2,c)
(2,c)
(0,a)
(0,a)
(0,a)

So partition 1 with just a single element was skipped 因此，仅包含一个元素的分区1被跳过了

有没有办法忽略Spark中元素很少的RDD分区上的处理？

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-06-12 09:15:06

有没有办法忽略Spark中元素很少的RDD分区上的处理？

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-06-12 09:15:06

解决方案1
1 已采纳 2019-06-12 09:15:06