简体   繁体   English

修改火花中对象的RDD(scala)

[英]modifying RDD of object in spark (scala)

I have:我有:

val rdd1: RDD[myClass]

it has been initialized, i checked while debugging all the members have got thier default values它已被初始化,我在调试时检查所有成员都有他们的默认值

If i do如果我做

rdd1.foreach(x=>x.modifier())

where modifier is a member function of myClass which modifies some of the member variables其中修饰符是 myClass 的成员函数,它修改了一些成员变量

After executing this if i check the values inside the RDD they have not been modified.执行此操作后,如果我检查 RDD 中的值,它们尚未被修改。

Can someone explain what's going on here?有人可以解释这里发生了什么吗? And is it possible to make sure the values are modified inside the RDD?是否可以确保在 RDD 中修改值?

EDIT:编辑:

class myClass(var id:String,var sessions: Buffer[Long],var avgsession: Long)  {
    def calcAvg(){
   // calculate avg by summing over sessions and dividing by legnth
   // Store this average in avgsession
    }
}

The avgsession attribute is not updating if i do如果我这样做,则 avgsession 属性不会更新

myrdd.foreach(x=>x.calcAvg())

RDD are immutable, calling a mutating method on the objects it contains will not have any effect. RDD 是不可变的,对它所包含的对象调用变异方法不会有任何影响。

The way to obtain the result you want is to produce new copies of MyClass instead of modifying the instance:获得你想要的结果的方法是生成MyClass新副本而不是修改实例:

case class MyClass(id:String, avgsession: Long) {
    def modifier(a: Int):MyClass = 
       this.copy(avgsession = this.avgsession + a) 
}

Now you still cannot update rdd1, but you can obtain rdd2 that will contain the updated instances:现在您仍然无法更新 rdd1,但您可以获得包含更新实例的 rdd2:

rdd2 = rdd1.map (_.modifier(18) ) 

The answer to this question is slightly more nuanced than the original accepted answer here.这个问题的答案比这里最初接受的答案要微妙得多。 The original answer is correct only with respect to data that is not cached in memory.原始答案仅对于未缓存在内存中的数据是正确的。 RDD data that is cached in memory can be mutated in memory as well and the mutations will remain even though the RDD is supposed to be immutable.缓存在内存中的 RDD 数据也可以在内存中进行变异,即使 RDD 被认为是不可变的,变异仍将保留。 Consider the following example:考虑以下示例:

val rdd = sc.parallelize(Seq(new mutable.HashSet[Int]()))
rdd.foreach(_+=1)
rdd.collect.foreach(println)

If you run that example you will get Set() as the result just like the original answer states.如果您运行该示例,您将获得Set()作为结果,就像原始答案状态一样。

However if you were to run the exact same thing with a cache call:但是,如果您要使用缓存调用运行完全相同的内容:

val rdd = sc.parallelize(Seq(new mutable.HashSet[Int]()))
rdd.cache
rdd.foreach(_+=1)
rdd.collect.foreach(println)

Now the result will print as Set(1) .现在结果将打印为Set(1) So it depends on whether the data is being cached in memory.所以这取决于数据是否被缓存在内存中。 If spark is recomputing from source or reading from a serialized copy on disk, then it will always reset back to the original object and appear to be immutable but if it's not loading from a serialized form then the mutations will in fact stick.如果 spark 从源重新计算或从磁盘上的序列化副本读取,那么它总是会重置回原始对象并且看起来是不可变的,但是如果它不是从序列化形式加载,那么突变实际上会保持不变。

Objects are immutable.对象是不可变的。 By using map, you can iterate over the rdd and return a new one.通过使用 map,您可以遍历 rdd 并返回一个新的。

val rdd2 = rdd1.map(x=>x.modifier())

I have observed that code like yours will work after calling RDD.persist when running in spark/yarn.我观察到像你这样的代码在 spark/yarn 中运行时调用 RDD.persist 后可以工作。 It is probably unsupported/accidental behavior and you should avoid it - but it is a workaround that may help in a pinch.这可能是不受支持/意外的行为,您应该避免它 - 但它是一种可能有助于解决问题的解决方法。 I'm running version 1.5.0.我正在运行 1.5.0 版。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM