[英]spark foreachPartition, how to get an index of each partition?
spark foreachPartition
, how to get an index of the partition (or sequence number, or something to identify the partition)? spark
foreachPartition
,如何获取分区的索引(或序列号,或识别分区的东西)?
val docs: RDD[String] = ...
println("num partitions: " + docs.getNumPartitions)
docs.foreachPartition((it: Iterator[String]) => {
println("partition index: " + ???)
it.foreach(...)
})
You can use TaskContext
( How to get ID of a map task in Spark? ): 您可以使用
TaskContext
( 如何在Spark中获取地图任务的ID? ):
import org.apache.spark.TaskContext
rdd.foreachPartition((it: Iterator[String]) => {
println(TaskContext.getPartitionId)
})
Not exactly identical, but you can use RDD.mapPartitionsWithIndex
and return an Iterator[Unit]
as a result: 不完全相同,但您可以使用
RDD.mapPartitionsWithIndex
并返回Iterator[Unit]
作为结果:
val rdd: RDD[Unit] = docs.mapPartitionsWithIndex { case (idx, it) =>
println("partition index: " + ???)
it.foreach(...)
}
But then you have to remember to materialize the RDD
但是你必须记住实现
RDD
An alternative would be to use mapPartitionsWithIndex
to do logic related to transforming the data, and then using foreachRDD
just to send the data externally. 另一种方法是使用
mapPartitionsWithIndex
来执行与转换数据相关的逻辑,然后使用foreachRDD
从外部发送数据。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.