[英]Why this Spark code works in local mode but not in cluster mode?
So, I have something like this.所以,我有这样的事情。 Note that
baseTrait
(a trait) here is serializable, and therefore thisClass
(an Object class) should also be serializable.请注意,此处的
baseTrait
(一个 trait)是可序列化的,因此thisClass
(一个 Object 类)也应该是可序列化的。
object thisClass extends baseTrait {
private var someVar = null
def someFunc: RDD[...] {
...
// assigned some string value or an empty string value (not null anymore)
someVar = ...
...
if (someVar != "")
someRDD.filter(x => aFunc(x, someVar))
else
...
}
In cluster mode, when I call the someFunc
function (which is a static method since thisClass
is an Object class) I get a null pointer exception, which I think has to do with someVar
not being serialized properly. In cluster mode, when I call the
someFunc
function (which is a static method since thisClass
is an Object class) I get a null pointer exception, which I think has to do with someVar
not being serialized properly. Because when I do this, it works perfectly in cluster mode.因为当我这样做时,它可以在集群模式下完美运行。
if (someVar != "") {
val someVar_ = someVar
someRDD.filter(x => aFunc(x, someVar_))
}
Any idea what was going wrong in the original code, when thisClass
is serializable in the first place?当
thisClass
是可序列化的时,知道原始代码中出了什么问题吗?
My guess is that its fine to use a variable of a serializable class from within another class, but if you try to do it inside that class, you can have problems, as in that case you would have the runtime trying to serialize the same class where the closure is being called from. My guess is that its fine to use a variable of a serializable class from within another class, but if you try to do it inside that class, you can have problems, as in that case you would have the runtime trying to serialize the same class从哪里调用闭包。 What do you think?
你怎么看?
You are not experiencing a problem with serialization in this case.在这种情况下,您没有遇到序列化问题。
Basically, what happens in cluster mode is that thisClass.someFunc
is never actually executed in the remote executor's JVM.基本上,在集群模式下发生的事情是
thisClass.someFunc
从未在远程执行器的 JVM 中实际执行。 On the executor, thisClass
is instantiated, and someVar
is assigned null
.在执行器上,
thisClass
被实例化, someVar
被分配null
。 Then while the thisClass
object is in that state, the spark framework executes your lambda function directly on the records that are available in that executor's partition of the data. Then while the
thisClass
object is in that state, the spark framework executes your lambda function directly on the records that are available in that executor's partition of the data.
A way to avoid this is to move the assignment to someVar
into the body of the thisClass
object.避免这种情况的一种方法是将分配给
someVar
移动到thisClass
object 的主体中。 Doing that will assign the value to someVar
immediately when the object is instantiated.这样做会在 object 被实例化时立即将值分配给
someVar
。 Bear in mind that this code will be executed on every executor in the cluster.请记住,此代码将在集群中的每个执行程序上执行。
If that is not possible, another option would be to map your RDD[T]
to RDD[(T, String)]
, where the string is someVar
for every record, and then your filter could be something like .filter(x => aFunc(x._1, x._2))
.如果这是不可能的,另一个选择是 map 你的
RDD[T]
到RDD[(T, String)]
,其中字符串是每条记录的someVar
,然后你的过滤器可能类似于.filter(x => aFunc(x._1, x._2))
。 This method will use more memory, as you'll have many copies of someVar
's value.此方法将使用更多 memory,因为您将拥有
someVar
值的许多副本。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.