简体   繁体   English

为什么这个 Spark 代码在本地模式下工作,而不在集群模式下工作?

[英]Why this Spark code works in local mode but not in cluster mode?

So, I have something like this.所以,我有这样的事情。 Note that baseTrait (a trait) here is serializable, and therefore thisClass (an Object class) should also be serializable.请注意,此处的baseTrait (一个 trait)是可序列化的,因此thisClass (一个 Object 类)也应该是可序列化的。

object thisClass extends baseTrait {
  private var someVar = null 

  def someFunc: RDD[...] {
    ...
    // assigned some string value or an empty string value (not null anymore)
    someVar = ... 
    ...
    if (someVar != "")
      someRDD.filter(x => aFunc(x, someVar))
    else
      ...
  }

In cluster mode, when I call the someFunc function (which is a static method since thisClass is an Object class) I get a null pointer exception, which I think has to do with someVar not being serialized properly. In cluster mode, when I call the someFunc function (which is a static method since thisClass is an Object class) I get a null pointer exception, which I think has to do with someVar not being serialized properly. Because when I do this, it works perfectly in cluster mode.因为当我这样做时,它可以在集群模式下完美运行。

if (someVar != "") {
  val someVar_ = someVar
  someRDD.filter(x => aFunc(x, someVar_))
}

Any idea what was going wrong in the original code, when thisClass is serializable in the first place?thisClass是可序列化的时,知道原始代码中出了什么问题吗?

My guess is that its fine to use a variable of a serializable class from within another class, but if you try to do it inside that class, you can have problems, as in that case you would have the runtime trying to serialize the same class where the closure is being called from. My guess is that its fine to use a variable of a serializable class from within another class, but if you try to do it inside that class, you can have problems, as in that case you would have the runtime trying to serialize the same class从哪里调用闭包。 What do you think?你怎么看?

You are not experiencing a problem with serialization in this case.在这种情况下,您没有遇到序列化问题。

Basically, what happens in cluster mode is that thisClass.someFunc is never actually executed in the remote executor's JVM.基本上,在集群模式下发生的事情是thisClass.someFunc从未在远程执行器的 JVM 中实际执行。 On the executor, thisClass is instantiated, and someVar is assigned null .在执行器上, thisClass被实例化, someVar被分配null Then while the thisClass object is in that state, the spark framework executes your lambda function directly on the records that are available in that executor's partition of the data. Then while the thisClass object is in that state, the spark framework executes your lambda function directly on the records that are available in that executor's partition of the data.

A way to avoid this is to move the assignment to someVar into the body of the thisClass object.避免这种情况的一种方法是将分配给someVar移动到thisClass object 的主体中。 Doing that will assign the value to someVar immediately when the object is instantiated.这样做会在 object 被实例化时立即将值分配给someVar Bear in mind that this code will be executed on every executor in the cluster.请记住,此代码将在集群中的每个执行程序上执行。

If that is not possible, another option would be to map your RDD[T] to RDD[(T, String)] , where the string is someVar for every record, and then your filter could be something like .filter(x => aFunc(x._1, x._2)) .如果这是不可能的,另一个选择是 map 你的RDD[T]RDD[(T, String)] ,其中字符串是每条记录的someVar ,然后你的过滤器可能类似于.filter(x => aFunc(x._1, x._2)) This method will use more memory, as you'll have many copies of someVar 's value.此方法将使用更多 memory,因为您将拥有someVar值的许多副本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在集群模式下在 Spark 的 MongoPartition 中获取 ClassNotFoundException,在本地模式下,代码运行良好。 如何解决? - Getting ClassNotFoundException in MongoPartition in Spark in cluster mode, in local mode, code runs fine. How to resolve? 使用Spark in Cluster模式将文件写入本地系统 - Writing files to local system with Spark in Cluster mode Spark mesos 集群模式比本地模式慢 - Spark mesos cluster mode is slower than local mode 在 Yarn Cluster 模式下执行的 Spark Scala 代码中读取本地/linux 文件 - Read local/linux files in Spark Scala code executing in Yarn Cluster Mode 为什么在群集模式下读取CSV文件失败(而在本地模式下工作正常)? - Why does reading CSV file fail in cluster mode (while works fine in local)? 无法在Spark Kubernetes集群模式下读取本地文件 - Unable to read local files in spark kubernetes cluster mode 在Spark Submit中访问Spark集群模式 - Accessing spark cluster mode in spark submit Spark-在纱线群集模式下jdbc写失败,但在spark-shell中工作 - Spark - jdbc write fails in Yarn cluster mode but works in spark-shell Spark:在本地模式下广播使用情况 - Spark: Broadcast usage on local mode Spark本地模式下的执行程序数 - Number of Executors in Spark Local Mode
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM