通用解析器类“任务不可序列化”

Question

I'm trying to construct class which receives a parser as an argument and uses this parser on each line. 我正在尝试构造一个类，该类接收解析器作为参数，并在每一行上使用该解析器。 Below is a minimal example that you can paste into spark-shell . 下面是一个最小的示例，您可以将其粘贴到spark-shell 。

import scala.util.{Success,Failure,Try}
import scala.reflect.ClassTag

class Reader[T : ClassTag](makeParser: () => (String => Try[T])) {

  def read(): Seq[T] = {

    val rdd = sc.parallelize(Seq("1","2","oops","4")) mapPartitions { lines =>

      // Since making a parser can be expensive, we want to make only one per partition.
      val parser: String => Try[T] = makeParser()

      lines flatMap { line =>
        parser(line) match {
          case Success(record) => Some(record)
          case Failure(_) => None
        }
      }
    }

    rdd.collect()
  }
}

class IntParser extends (String => Try[Int]) with Serializable {
  // There could be an expensive setup operation here...
  def apply(s: String): Try[Int] = Try { s.toInt }
}

However, when I try to run something like new Reader(() => new IntParser).read() (which type-checks just fine) I get the dreaded org.apache.spark.SparkException: Task not serializable error relating to closures. 但是，当我尝试运行类似new Reader(() => new IntParser).read() （类型检查就很好）时，我得到了可怕的org.apache.spark.SparkException: Task not serializable与闭包有关的org.apache.spark.SparkException: Task not serializable错误。

Why is there an error and is there a way to re-engineer the above to avoid this (while keeping Reader generic)? 为什么会有错误，并且有办法重新设计以上内容以避免发生这种情况（同时使Reader保持通用）？

Answer 1

The problem is that makeParser is variable to class Reader and since you are using it inside rdd transformations spark will try to serialize the entire class Reader which is not serializable. 问题是makeParser对于class Reader是变量，并且由于您在rdd转换中使用它，因此spark将尝试序列化整个不可序列化的Reader类。 So you will get task not serializable exception. 因此，您将获得任务不可序列化的异常。

Adding Serializable to the class Reader will work with your code. 将Serializable添加到类Reader中将与您的代码一起使用。 But that is not a good practice since it will serialize entire class variables which might not be needed. 但这不是一个好习惯，因为它将序列化可能不需要的整个类变量。

In general you could use the functions instead of method to avoid serialization issues. 通常，您可以使用函数而不是方法来避免序列化问题。 Because in scala functions are actually objects and it will be serialized. 因为在scala中，函数实际上是对象，因此将被序列化。

Refer to this answer : Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects 请参阅此答案：任务不可序列化：仅在类而非对象上调用闭包之外的函数时，java.io.NotSerializableException

通用解析器类“任务不可序列化”

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-06-08 07:19:10

通用解析器类“任务不可序列化”

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-06-08 07:19:10

解决方案1
2 已采纳 2016-06-08 07:19:10