spark foreachPartition，如何获取每个分区的索引？

Question

spark foreachPartition , how to get an index of the partition (or sequence number, or something to identify the partition)? spark foreachPartition ，如何获取分区的索引（或序列号，或识别分区的东西）？

val docs: RDD[String] = ...

println("num partitions: " + docs.getNumPartitions)

docs.foreachPartition((it: Iterator[String]) => {
  println("partition index: " + ???)
  it.foreach(...)
})

Answer 1

You can use TaskContext ( How to get ID of a map task in Spark? ): 您可以使用TaskContext （如何在Spark中获取地图任务的ID？）：

import org.apache.spark.TaskContext

rdd.foreachPartition((it: Iterator[String]) => {
  println(TaskContext.getPartitionId)
})

Answer 2

Not exactly identical, but you can use RDD.mapPartitionsWithIndex and return an Iterator[Unit] as a result: 不完全相同，但您可以使用RDD.mapPartitionsWithIndex并返回Iterator[Unit]作为结果：

val rdd: RDD[Unit] = docs.mapPartitionsWithIndex { case (idx, it) => 
  println("partition index: " + ???)
  it.foreach(...)
}

But then you have to remember to materialize the RDD 但是你必须记住实现RDD

An alternative would be to use mapPartitionsWithIndex to do logic related to transforming the data, and then using foreachRDD just to send the data externally. 另一种方法是使用mapPartitionsWithIndex来执行与转换数据相关的逻辑，然后使用foreachRDD从外部发送数据。

spark foreachPartition，如何获取每个分区的索引？

问题描述

2 个解决方案

解决方案1
7 已采纳 2018-01-22 14:51:05

解决方案2
4 2018-01-22 14:39:10

spark foreachPartition，如何获取每个分区的索引？

问题描述

2 个解决方案

解决方案1 7 已采纳 2018-01-22 14:51:05

解决方案2 4 2018-01-22 14:39:10

解决方案1
7 已采纳 2018-01-22 14:51:05

解决方案2
4 2018-01-22 14:39:10