modifying RDD of object in spark (scala)

Question

I have:

val rdd1: RDD[myClass]

it has been initialized, i checked while debugging all the members have got thier default values

If i do

rdd1.foreach(x=>x.modifier())

where modifier is a member function of myClass which modifies some of the member variables

After executing this if i check the values inside the RDD they have not been modified.

Can someone explain what's going on here? And is it possible to make sure the values are modified inside the RDD?

EDIT:

class myClass(var id:String,var sessions: Buffer[Long],var avgsession: Long)  {
    def calcAvg(){
   // calculate avg by summing over sessions and dividing by legnth
   // Store this average in avgsession
    }
}

The avgsession attribute is not updating if i do

myrdd.foreach(x=>x.calcAvg())

Answer 1

RDD are immutable, calling a mutating method on the objects it contains will not have any effect.

The way to obtain the result you want is to produce new copies of MyClass instead of modifying the instance:

case class MyClass(id:String, avgsession: Long) {
    def modifier(a: Int):MyClass = 
       this.copy(avgsession = this.avgsession + a) 
}

Now you still cannot update rdd1, but you can obtain rdd2 that will contain the updated instances:

rdd2 = rdd1.map (_.modifier(18) )

Answer 2

The answer to this question is slightly more nuanced than the original accepted answer here. The original answer is correct only with respect to data that is not cached in memory. RDD data that is cached in memory can be mutated in memory as well and the mutations will remain even though the RDD is supposed to be immutable. Consider the following example:

val rdd = sc.parallelize(Seq(new mutable.HashSet[Int]()))
rdd.foreach(_+=1)
rdd.collect.foreach(println)

If you run that example you will get Set() as the result just like the original answer states.

However if you were to run the exact same thing with a cache call:

val rdd = sc.parallelize(Seq(new mutable.HashSet[Int]()))
rdd.cache
rdd.foreach(_+=1)
rdd.collect.foreach(println)

Now the result will print as Set(1) . So it depends on whether the data is being cached in memory. If spark is recomputing from source or reading from a serialized copy on disk, then it will always reset back to the original object and appear to be immutable but if it's not loading from a serialized form then the mutations will in fact stick.

Answer 3

Objects are immutable. By using map, you can iterate over the rdd and return a new one.

val rdd2 = rdd1.map(x=>x.modifier())

Answer 4

I have observed that code like yours will work after calling RDD.persist when running in spark/yarn. It is probably unsupported/accidental behavior and you should avoid it - but it is a workaround that may help in a pinch. I'm running version 1.5.0.

modifying RDD of object in spark (scala)

Question

4 answers

solution1
8 ACCPTED 2015-06-18 11:38:36

solution2
2 2020-01-14 18:48:02

solution3
1 2019-04-14 22:22:13

solution4
0 2016-03-03 17:18:01

modifying RDD of object in spark (scala)

Question

4 answers

solution1 8 ACCPTED 2015-06-18 11:38:36

solution2 2 2020-01-14 18:48:02

solution3 1 2019-04-14 22:22:13

solution4 0 2016-03-03 17:18:01

solution1
8 ACCPTED 2015-06-18 11:38:36

solution2
2 2020-01-14 18:48:02

solution3
1 2019-04-14 22:22:13

solution4
0 2016-03-03 17:18:01