How to overwrite a RDD in a loop

Question

I am very new to Spark and Scala and I am implementing an iterative algorithm that manipulates a big graph. Assume that inside a for loop, we have two RDDs (rdd1 and rdd2) and their value get updated. for example something like:

for (i <- 0 to 5){
   val rdd1 = rdd2.some Transformations
   rdd2 = rdd1
}

so basically, during iteration i+1 the value of rdd1 is computed based on its value at iteration i. I know that RDDs are immutable so I can not really reassign anything to them, but I just wanted to know, what I have in mind is possible to implement or not. If so, how? Any help is greatly appreciated.

Thanks,

updated: when I try this code:

var size2 = freqSubGraphs.join(groupedNeighbours).map(y => extendFunc(y))

for(i <- 0 to 5){
    var size2 = size2.map(y=> readyForExpandFunc(y))
}
size2.collect()

it is giving me this error: "recursive variable size2 needs type" I am not sure what it means

Answer 1

Just open a spark-shell and try it:

scala> var rdd1 = sc.parallelize(List(1,2,3,4,5))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> for( i <- 0 to 5 ) { rdd1 = rdd1.map( _ + 1 ) }

scala> rdd1.collect()
res1: Array[Int] = Array(7, 8, 9, 10, 11)

as you can see, it works.

Answer 2

Just for completeness, you can use foldRight to avoid using a mutable var if you want your code to be more purely idiomatic:

val zeroRdd = freqSubGraphs.join(groupedNeighbours).map(y => extendFunc(y))
val size2 = (0 to 5).foldRight(zeroRdd) {
  (_, rdd) => rdd.map(y => readyForExpandFunc(y))
}

Answer 3

The way that you can access data on RDD will depend on the structure of it. If you want to perform some calculations with the data that you have in a single item, you can use directly map :

val intRDD = spark.sparkContext.parallelize(Seq(1,2,3,4,5))
val multipliedBy10 = intRDD.map(myInteger=>myInteger*10)
print(multipliedBy10.collect.toList) // output: List(10, 20, 30, 40, 50)

If your RDD contains multiple values (ie: a tuple), you can do:

val tupleRDD = spark.sparkContext.parallelize(Seq(('A', 1), ('B', 2), ('C', 3)))
val concatTuple = tupleRDD.map(tuple=>tuple._1 + "-" + tuple._2)
print(concatTuple.collect.toList) // output: List(A-1, B-2, C-3)

If you need also data from another RDD to do your calculations, I would recommend first join both RDD 's

How to overwrite a RDD in a loop

Question

3 answers

solution1
2 2017-01-17 04:36:33

solution2
2 2017-01-17 09:56:08

solution3
0 2019-10-30 15:33:14

How to overwrite a RDD in a loop

Question

3 answers

solution1 2 2017-01-17 04:36:33

solution2 2 2017-01-17 09:56:08

solution3 0 2019-10-30 15:33:14

solution1
2 2017-01-17 04:36:33

solution2
2 2017-01-17 09:56:08

solution3
0 2019-10-30 15:33:14