[英]Is there a way to omit processing over a RDD partition with few elements in Spark?
I have an RDD and I need to apply a computation on each partition (using .mapPartition
) but only if the current partition of data have more than X elements. 我有一个RDD,并且仅当当前数据分区具有多个X元素时,才需要在每个分区上应用计算(使用
.mapPartition
)。
Example: The number of elements in each partition of the RDD is: 示例:RDD的每个分区中的元素数为:
80, 9, 0, 0, 0, 3, 60
80、9、0、0、0、3、60
I want to process only over the partitions with more than 50 elements. 我只想处理包含50个以上元素的分区。
Is this even possible? 这有可能吗?
Can be also done lazily without pre-calculating sizes. 也可以懒惰地完成而无需预先计算尺寸。 Filtering to partitions with at least two elements in this example
在此示例中,过滤到包含至少两个元素的分区
import org.apache.spark.Partitioner
object DemoPartitioner extends Partitioner {
override def numPartitions: Int = 3
override def getPartition(key: Any): Int = key match {
case num: Int => num
}
}
sc.parallelize(Seq((0, "a"), (0, "a"), (0, "a"), (1, "b"), (2, "c"), (2, "c")))
.partitionBy(DemoPartitioner) // create 3 partitions of sizes 3,1,2
.mapPartitions { it =>
val firstElements = it.take(2).toSeq
if (firstElements.size < 2) {
Iterator.empty
} else {
firstElements.iterator ++ it
}
}.foreach(println)
Output: 输出:
(2,c)
(2,c)
(0,a)
(0,a)
(0,a)
So partition 1 with just a single element was skipped 因此,仅包含一个元素的分区1被跳过了
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.