简体   繁体   English

spark foreachPartition,如何获取每个分区的索引?

[英]spark foreachPartition, how to get an index of each partition?

spark foreachPartition , how to get an index of the partition (or sequence number, or something to identify the partition)? spark foreachPartition ,如何获取分区的索引(或序列号,或识别分区的东西)?

val docs: RDD[String] = ...

println("num partitions: " + docs.getNumPartitions)

docs.foreachPartition((it: Iterator[String]) => {
  println("partition index: " + ???)
  it.foreach(...)
})

You can use TaskContext ( How to get ID of a map task in Spark? ): 您可以使用TaskContext如何在Spark中获取地图任务的ID? ):

import org.apache.spark.TaskContext

rdd.foreachPartition((it: Iterator[String]) => {
  println(TaskContext.getPartitionId)
})

Not exactly identical, but you can use RDD.mapPartitionsWithIndex and return an Iterator[Unit] as a result: 不完全相同,但您可以使用RDD.mapPartitionsWithIndex并返回Iterator[Unit]作为结果:

val rdd: RDD[Unit] = docs.mapPartitionsWithIndex { case (idx, it) => 
  println("partition index: " + ???)
  it.foreach(...)
}

But then you have to remember to materialize the RDD 但是你必须记住实现RDD

An alternative would be to use mapPartitionsWithIndex to do logic related to transforming the data, and then using foreachRDD just to send the data externally. 另一种方法是使用mapPartitionsWithIndex来执行与转换数据相关的逻辑,然后使用foreachRDD从外部发送数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 foreachPartition 在 Spark 中为每个分区高效地构建一个 ML model? - How can you efficiently build one ML model per partition in Spark with foreachPartition? 使用 RDD.mapPartitionsWithIndex 时如何获取每个分区的索引? - how to get an index of each partition when using RDD.mapPartitionsWithIndex? 如何在Spark中的每个分区上找到总和 - How to find Sum at Each partition in Spark Spark Scala 从 rdd.foreachPartition 取回数据 - Spark Scala Get Data Back from rdd.foreachPartition 如何在Spark 2.2中使用foreachPartition以避免任务序列化错误 - How to use foreachPartition in Spark 2.2 to avoid Task Serialization error spark:如何用另一个RDD的每个分区压缩一个RDD - spark: how to zip an RDD with each partition of the other RDD 使用Dataproc上的Spark,如何分别从每个分区写入GCS? - Using Spark on Dataproc, how to write to GCS separately from each partition? Scala中的Spark:如何避免在每个分区中搜索密钥的线性扫描? - Spark in Scala: How to avoid linear scan for searching a key in each partition? spark shuffle partitions 和 partition by tag 如何相互配合 - How spark shuffle partitions and partition by tag along with each other Spark流:foreachPartition中的NullPointerException - Spark Streaming: NullPointerException inside foreachPartition
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM