简体   繁体   English

在Spark RDD操作中使用类方法返回无法序列化的任务异常

[英]Using Class Methods in Spark RDD Operations Returns Task not serializable Exception

Suppose I have the following class in Spark Scala: 假设我在Spark Scala中有以下课程:

class SparkComputation(i: Int, j: Int) {
  def something(x: Int, y: Int) = (x + y) * i

  def processRDD(data: RDD[Int]) = {
    val j = this.j
    val something = this.something _
    data.map(something(_, j))
  }
}

I get the Task not serializable Exception when I run the following code: 运行以下代码时,出现“ Task not serializable Exception

val s = new SparkComputation(2, 5)
val data = sc.parallelize(0 to 100)
val res = s.processRDD(data).collect

I'm assuming that the exception occurs because Spark is trying to serialize the SparkComputation instance. 我假设发生异常是因为Spark正在尝试序列化SparkComputation实例。 To prevent this from happening, I have stored the class members I'm using in the RDD operation in local variables ( j and something ). 为了防止这种情况的发生,我将在RDD操作中使用的类成员存储在局部变量( jsomething )中。 However, Spark still tries to serialize SparkComputation object because of the method. 但是,由于该方法,Spark仍尝试序列化SparkComputation对象。 Is there anyway to pass the class method to map without forcing Spark to serializing the whole SparkComputation class? 无论如何,有没有传递类方法进行map而无需强制Spark序列化整个SparkComputation类? I know the following code works without any problem: 我知道以下代码可以正常工作:

def processRDD(data: RDD[Int]) = {
    val j = this.j
    val i = this.i
    data.map(x => (x + j) * i)
  }

So, the class members that store values are not causing the problem. 因此,存储值的类成员不会导致此问题。 The problem is with the function. 问题出在功能上。 I have also tried the following approach with no luck: 我也尝试过以下方法,但是没有运气:

class SparkComputation(i: Int, j: Int) {
  def processRDD(data: RDD[Int]) = {
    val j = this.j
    val i = this.i
    def something(x: Int, y: Int) = (x + y) * i
    data.map(something(_, j))
  }
}

Make the class serializable: 使类可序列化:

class SparkComputation(i: Int, j: Int) extends Serializable {
  def something(x: Int, y: Int) = (x + y) * i

  def processRDD(data: RDD[Int]) = {
    val j = this.j
    val something = this.something _
    data.map(something(_, j))
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark:任务不可序列化(广播/RDD/SparkContext) - Spark: Task not serializable (Broadcast/RDD/SparkContext) Spark UDF-任务不可序列化异常 - Spark UDF - Task not serializable exception 无法解决不可序列化的任务 [org.apache.spark.SparkException:任务不可序列化] Spark Scala RDD - Cannot resolve task not serializable [org.apache.spark.SparkException: Task not serializable] Spark Scala RDD 在Spark Scala中使用自定义数据框类时无法序列化任务 - Task not serializable while using custom dataframe class in Spark Scala Twitter Spark流过滤:任务不可序列化异常 - Twitter Spark Stream Filtering: Task not serializable exception 比较rdd scala中的两个字符串并获得异常任务不可序列化 - compare two strings in rdd scala and got exception Task not serializable 尝试编写通用记录类型的rdd时,Task Not Serializable异常 - Task Not Serializable exception when trying to write a rdd of type Generic Record Spark RDD:AggregateByKey 抛出不可序列化的任务,我看不到不可序列化的对象 - Spark RDD: AggregateByKey throws Task not serializable and I can't see non serializable objects 当Class应用Serializable时,Apache Spark任务不可序列化 - Apache Spark Task not Serializable when Class exends Serializable 使用字段变量时触发“任务无法序列化” - Spark “Task not serializable” when using field variables
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM