简体   繁体   English

火花RDD的折叠方法的说明

[英]Explanation of fold method of spark RDD

I am running Spark-1.4.0 pre-built for Hadoop-2.4 (in local mode) to calculate the sum of squares of a DoubleRDD. 我正在运行为Hadoop-2.4预先构建的Spark-1.4.0(以本地模式),以计算DoubleRDD的平方和。 My Scala code looks like 我的Scala代码看起来像

sc.parallelize(Array(2., 3.)).fold(0.0)((p, v) => p+v*v)

And it gave a surprising result 97.0 . 而且它给出了令人惊讶的结果97.0

This is quite counter-intuitive compared to the Scala version of fold 与Scala版本的fold相比,这非常违反直觉

Array(2., 3.).fold(0.0)((p, v) => p+v*v)

which gives the expected answer 13.0 . 这给出了预期的答案13.0

It seems quite likely that I have made some tricky mistakes in the code due to a lack of understanding. 由于缺乏理解,我很可能在代码中犯了一些棘手的错误。 I have read about how the function used in RDD.fold() should be communicative otherwise the result may depend on partitions and etc. So example, if I change the number of partitions to 1, 我已经阅读了RDD.fold()使用的函数应该如何通信的信息,否则结果可能取决于分区等。因此,例如,如果我将分区数更改为1,

sc.parallelize(Array(2., 3.), 1).fold(0.0)((p, v) => p+v*v)

the code will give me 169.0 on my machine! 该代码将在我的机器上给我169.0

Can someone explain what exactly is happening here? 有人可以解释这里到底发生了什么吗?

Well, it is actually pretty well explained by the official documentation : 嗯, 官方文档实际上对它进行了很好的解释:

Aggregate the elements of each partition, and then the results for all the partitions, using a given associative and commutative function and a neutral "zero value". 使用给定的关联和交换函数以及中性的“零值”,汇总每个分区的元素,然后汇总所有分区的结果。 The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; 函数op(t1,t2)可以修改t1并将其作为结果值返回,以避免对象分配; however, it should not modify t2. 但是,它不应修改t2。

This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala. 这与以Scala之类的功能语言为非分布式集合实现的折叠操作有些不同。 This fold operation may be applied to partitions individually, and then fold those results into the final result, rather than apply the fold to each element sequentially in some defined ordering. 该折叠操作可以单独应用于分区,然后将那些结果折叠为最终结果,而不是以某些定义的顺序将折叠应用于每个元素。 For functions that are not commutative, the result may differ from that of a fold applied to a non-distributed collection. 对于非交换函数,结果可能与应用于非分布式集合的折叠结果不同。

To illustrate what is going on lets try to simulate what is going on step by step: 为了说明正在发生的事情,让我们尝试逐步模拟正在发生的事情:

val rdd = sc.parallelize(Array(2., 3.))

val byPartition = rdd.mapPartitions(
    iter => Array(iter.fold(0.0)((p, v) => (p +  v * v))).toIterator).collect()

It gives us something similar to this Array[Double] = Array(0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 0.0, 9.0) and 它给我们类似于Array[Double] = Array(0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 0.0, 9.0)

byPartition.reduce((p, v) => (p + v * v))

returns 97 返回97

Important thing to note is that results can differ from run to run depending on an order in which partitions are combined. 需要注意的重要一点是,每次运行的结果可能会有所不同,具体取决于分区合并的顺序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM