[英]How to overwrite a RDD in a loop
I am very new to Spark and Scala and I am implementing an iterative algorithm that manipulates a big graph.我对 Spark 和 Scala 非常陌生,我正在实现一种操作大图的迭代算法。 Assume that inside a for loop, we have two RDDs (rdd1 and rdd2) and their value get updated.
假设在 for 循环中,我们有两个 RDD(rdd1 和 rdd2)并且它们的值得到更新。 for example something like:
例如类似的东西:
for (i <- 0 to 5){
val rdd1 = rdd2.some Transformations
rdd2 = rdd1
}
so basically, during iteration i+1 the value of rdd1 is computed based on its value at iteration i.所以基本上,在迭代 i+1 期间,rdd1 的值是根据它在迭代 i 时的值计算的。 I know that RDDs are immutable so I can not really reassign anything to them, but I just wanted to know, what I have in mind is possible to implement or not.
我知道 RDD 是不可变的,所以我不能真正为它们重新分配任何东西,但我只是想知道,我的想法是否可以实现。 If so, how?
如果是这样,如何? Any help is greatly appreciated.
任何帮助是极大的赞赏。
Thanks,谢谢,
updated: when I try this code:更新:当我尝试这段代码时:
var size2 = freqSubGraphs.join(groupedNeighbours).map(y => extendFunc(y))
for(i <- 0 to 5){
var size2 = size2.map(y=> readyForExpandFunc(y))
}
size2.collect()
it is giving me this error: "recursive variable size2 needs type" I am not sure what it means它给了我这个错误:“递归变量 size2 需要类型”我不确定这是什么意思
Just open a spark-shell and try it:只需打开一个 spark-shell 并尝试一下:
scala> var rdd1 = sc.parallelize(List(1,2,3,4,5))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> for( i <- 0 to 5 ) { rdd1 = rdd1.map( _ + 1 ) }
scala> rdd1.collect()
res1: Array[Int] = Array(7, 8, 9, 10, 11)
as you can see, it works.如您所见,它有效。
Just for completeness, you can use foldRight
to avoid using a mutable var
if you want your code to be more purely idiomatic:只是为了完整性,您可以使用
foldRight
避免使用可变var
如果你希望你的代码更纯粹地道:
val zeroRdd = freqSubGraphs.join(groupedNeighbours).map(y => extendFunc(y))
val size2 = (0 to 5).foldRight(zeroRdd) {
(_, rdd) => rdd.map(y => readyForExpandFunc(y))
}
The way that you can access data on RDD
will depend on the structure of it.访问
RDD
数据的方式取决于它的结构。 If you want to perform some calculations with the data that you have in a single item, you can use directly map
:如果要使用单个项目中的数据执行一些计算,可以直接使用
map
:
val intRDD = spark.sparkContext.parallelize(Seq(1,2,3,4,5))
val multipliedBy10 = intRDD.map(myInteger=>myInteger*10)
print(multipliedBy10.collect.toList) // output: List(10, 20, 30, 40, 50)
If your RDD
contains multiple values (ie: a tuple), you can do:如果您的
RDD
包含多个值(即:元组),您可以执行以下操作:
val tupleRDD = spark.sparkContext.parallelize(Seq(('A', 1), ('B', 2), ('C', 3)))
val concatTuple = tupleRDD.map(tuple=>tuple._1 + "-" + tuple._2)
print(concatTuple.collect.toList) // output: List(A-1, B-2, C-3)
If you need also data from another RDD
to do your calculations, I would recommend first join both RDD
's如果您还需要来自另一个
RDD
数据来进行计算,我建议您首先加入两个RDD
的
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.