简体   繁体   中英

Apache Spark RDD - not updating

I create a PairRDD which contains a Vector.

var newRDD = oldRDD.mapValues(listOfItemsAndRatings => Vector(Array.fill(2){math.random}))

Later on I update the RDD:

newRDD.lookup(ratingObject.user)(0) += 0.2 * (errorRate(rating) * myVector)

However, although it outputs an updated Vector (as shown in the console), when I next call newRDD I can see the Vector value has changed. Through testing I have concluded that it has changed to something given by math.random - as every time I call newRDD the Vector changes. I understand there is a lineage graph and maybe that has something to do with it. I need to update the Vector held in the RDD to new values and I need to do this repeatedly.

Thanks.

RDD are immutable structures meant to distribute operations on data over a cluster. There're two elements playing a role in the behavior you are observing here:

RDD lineage may be computed every time. In this case, it means that an action on newRDD might trigger the lineage computation, therefore applying the Vector(Array.fill(2){math.random}) transformation and resulting in new values each time. The lineage can be broken using cache , in which case the value of the transformation will be kept in memory and/or disk after the first time it's applied. This results in:

val randomVectorRDD = oldRDD.mapValues(listOfItemsAndRatings => Vector(Array.fill(2){math.random}))
randomVectorRDD.cache()

The second aspect that needs further consideration is the on-site mutation:

newRDD.lookup(ratingObject.user)(0) += 0.2 * (errorRate(rating) * myVector)

Although this might work on a single machine because all Vector references are local, it will not scale to a cluster as lookup references will be serialized and mutations will not be preserved. Therefore it bears the question of why use Spark for this.

To be implemented on Spark, this algorithm will need re-design in order to be expressed in terms of transformations instead of punctual lookup/mutations.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM