Apache Spark RDD - 不更新

Question

I create a PairRDD which contains a Vector. 我创建了一个包含Vector的PairRDD。

var newRDD = oldRDD.mapValues(listOfItemsAndRatings => Vector(Array.fill(2){math.random}))

Later on I update the RDD: 稍后我更新了RDD：

newRDD.lookup(ratingObject.user)(0) += 0.2 * (errorRate(rating) * myVector)

However, although it outputs an updated Vector (as shown in the console), when I next call newRDD I can see the Vector value has changed. 但是，虽然它输出了更新的Vector（如控制台中所示），但当我下次调用newRDD我可以看到Vector值已更改。 Through testing I have concluded that it has changed to something given by math.random - as every time I call newRDD the Vector changes. 通过测试我得出结论，它已经改为math.random给出的东西 - 因为每次我调用newRDD Vector都会改变。 I understand there is a lineage graph and maybe that has something to do with it. 我知道有一个谱系图，也许这与它有关。 I need to update the Vector held in the RDD to new values and I need to do this repeatedly. 我需要将RDD中保存的Vector更新为新值，我需要重复执行此操作。

Thanks. 谢谢。

Answer 1

RDD are immutable structures meant to distribute operations on data over a cluster. RDD是不可变的结构，旨在通过集群分发对数据的操作。 There're two elements playing a role in the behavior you are observing here: 在您观察到的行为中有两个元素发挥作用：

RDD lineage may be computed every time. 可以每次计算RDD谱系。 In this case, it means that an action on newRDD might trigger the lineage computation, therefore applying the Vector(Array.fill(2){math.random}) transformation and resulting in new values each time. 在这种情况下，这意味着对newRDD的操作可能会触发沿袭计算，因此应用Vector(Array.fill(2){math.random})转换并每次生成新值。 The lineage can be broken using cache , in which case the value of the transformation will be kept in memory and/or disk after the first time it's applied. 可以使用cache来破坏谱系，在这种情况下，转换的值将在第一次应用后保留在内存和/或磁盘中。 This results in: 这导致：

val randomVectorRDD = oldRDD.mapValues(listOfItemsAndRatings => Vector(Array.fill(2){math.random}))
randomVectorRDD.cache()

The second aspect that needs further consideration is the on-site mutation: 需要进一步考虑的第二个方面是现场突变：

newRDD.lookup(ratingObject.user)(0) += 0.2 * (errorRate(rating) * myVector)

Although this might work on a single machine because all Vector references are local, it will not scale to a cluster as lookup references will be serialized and mutations will not be preserved. 虽然这可能适用于单个计算机，因为所有Vector引用都是本地的，但它不会扩展到集群，因为查找引用将被序列化并且不会保留突变。 Therefore it bears the question of why use Spark for this. 因此，它存在为何使用Spark的问题。

To be implemented on Spark, this algorithm will need re-design in order to be expressed in terms of transformations instead of punctual lookup/mutations. 要在Spark上实现，该算法需要重新设计才能用转换表示，而不是准时查找/突变。

Apache Spark RDD - 不更新

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-03-21 12:52:17

Apache Spark RDD - 不更新

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-03-21 12:52:17

解决方案1
2 已采纳 2015-03-21 12:52:17