Spark mapPartitionsWithIndex ：标识一个分区

Question

确定一个分区：

mapPartitionsWithIndex(index, iter)

该方法导致在每个分区上驱动一个函数。 我知道我们可以使用“index”参数来跟踪分区。

许多示例都使用此方法使用“index = 0”条件删除数据集中的标题。 但是我们如何确保读取的第一个分区（翻译，“index”参数等于 0）确实是标题。 如果使用，它是随机的还是基于分区器的。

Answer 1

如果使用，它不是随机的还是基于分区器的？

它不是随机的，而是分区编号。 您可以通过下面提到的简单示例来理解它

val base = sc.parallelize(1 to 100, 4)    
base.mapPartitionsWithIndex((index, iterator) => {

  iterator.map { x => (index, x) }

}).foreach { x => println(x) }

结果： (0,1) (1,26) (2,51) (1,27) (0,2) (0,3) (0,4) (1,28) (2,52) (1, 29) (0,5) (1,30) (1,31) (2,53) (1,32) (0,6) ... ...

Spark mapPartitionsWithIndex ：标识一个分区

问题描述

1 个解决方案

解决方案1
5 已采纳 2017-06-12 15:55:36

Spark mapPartitionsWithIndex ：标识一个分区

问题描述

1 个解决方案

解决方案1 5 已采纳 2017-06-12 15:55:36

解决方案1
5 已采纳 2017-06-12 15:55:36