简体   繁体   English

在火花壳中使用mapPartitionsWithIndex将rowNumber()覆盖(partition_index)

[英]rowNumber() over(partition_index) using mapPartitionsWithIndex in spark-shell

I'm trying to add partition index and rownumber in partition to rdd and I did it. 我试图将分区索引和分区中的行号添加到rdd,并且做到了。 But when I tried to get the value of last rownumber I got zero, the rownumber array seemed untouched. 但是,当我尝试获取最后一个行号的值时,我得到的值为零,因此行号数组似乎未受影响。 Variable scope problem? 可变范围问题?

It's like rowNumber()/count() over(partition_index) but rownumber added along with partition index in one loop, so maybe more efficient? 就像rowNumber()/ count()over(partition_index)一样,但是在一个循环中将rownumber与分区索引一起添加了,所以效率更高吗?

Here comes the code: 代码如下:

scala> val rdd1 = sc.makeRDD(100 to 110)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[32] at makeRDD at <console>:25

scala> val rownums=new Array[Int](3)
rownums: Array[Int] = Array(0, 0, 0)

scala> val rdd2=rdd1.repartition(3).mapPartitionsWithIndex( (idx, itr) => itr.map(r => (idx, {rownums(idx)+=1;rownums(idx)}, r)) )
rdd2: org.apache.spark.rdd.RDD[(Int, Int, Int)] = MapPartitionsRDD[37] at mapPartitionsWithIndex at <console>:29

scala> rdd2.collect.foreach(println)
(0,1,100)
(0,2,107)
(0,3,104)
(0,4,105)
(0,5,106)
(0,6,110)
(1,1,102)
(1,2,108)
(1,3,103)
(2,1,101)
(2,2,109)

scala> //uneffected??

scala> rownums.foreach(println)
0
0
0

scala> rownums
res20: Array[Int] = Array(0, 0, 0)

I'm expecting (6,3,2) for rownums:( 我期望rownums为(6,3,2):(


Solved using Accumulator: 使用累加器解决:

scala> import org.apache.spark.util._
import org.apache.spark.util._

scala> val rownums=new Array[LongAccumulator](3)
rownums: Array[org.apache.spark.util.LongAccumulator] = Array(null, null, null)

scala> for(i <- 0 until rownums.length){rownums(i)=sc.longAccumulator("rownum_"+i)}

scala> val rdd1 = sc.makeRDD(100 to 110)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[92] at makeRDD at <console>:124

scala> val rownums2=new Array[Int](3)
rownums2: Array[Int] = Array(0, 0, 0)

scala> val rdd2=rdd1.repartition(3).mapPartitionsWithIndex( (idx, itr) => itr.map(r => (idx, {rownums2(idx)+=1;rownums(idx).add(1);rownums2(idx)}, r)) )
rdd2: org.apache.spark.rdd.RDD[(Int, Int, Int)] = MapPartitionsRDD[97] at mapPartitionsWithIndex at <console>:130

scala> rdd2.collect.foreach(println)
(0,1,107)                                                                       
(0,2,106)
(0,3,105)
(0,4,110)
(0,5,104)
(0,6,100)
(1,1,102)
(1,2,103)
(1,3,108)
(2,1,109)
(2,2,101)

scala> rownums.foreach(x=>println(x.value))
6
3
2

scala> 

Spark runs in a distributed system. Spark在分布式系统中运行。 That means you don't have access to modify elements outside your functions. 这意味着您无权修改函数外部的元素。

If you want to get an array with the count of each partition, you need to convert your RDD to an RDD[Int] where each row is the count of of the partition, then collect it. 如果要获取包含每个分区的计数的数组,则需要将RDD转换为RDD[Int] ,其中每一行都是分区的计数,然后收集它。

rdd.mapPartitions(itr => Iterator(itr.size))

If the partition index is important, you can create and RDD[Int,Int] to include it along with the row count. 如果分区索引很重要,则可以创建RDD[Int,Int]并将其与行数一起包括在内。

rdd.mapPartitionsWithIndex((idx, itr) => Iterator((idx, itr.size)))

Please read Understanding closures from the programming guide: 请阅读编程指南中的了解闭包

Prior to execution, Spark computes the task's closure. 在执行之前,Spark计算任务的结束时间。 The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case foreach()). 闭包是执行者在RDD上执行其计算所必须可见的那些变量和方法(在本例中为foreach())。 This closure is serialized and sent to each executor. 此闭包被序列化并发送给每个执行器。

The variables within the closure sent to each executor are now copies and thus, when counter is referenced within the foreach function, it's no longer the counter on the driver node. 发送给每个执行器的闭包中的变量现在是副本,因此,在foreach函数中引用计数器时,它不再是驱动程序节点上的计数器。 There is still a counter in the memory of the driver node but this is no longer visible to the executors! 驱动程序节点的内存中仍然存在一个计数器,但是执行者将不再看到该计数器! The executors only see the copy from the serialized closure. 执行者仅从序列化闭包中看到副本。 Thus, the final value of counter will still be zero since all operations on counter were referencing the value within the serialized closure. 因此,由于对计数器的所有操作都引用了序列化闭包内的值,所以计数器的最终值仍将为零。

You are modifying a local copy of the variable, not the original variable. 您正在修改变量的本地副本,而不是原始变量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM