简体   繁体   English

Spark Scala中的任务不可序列化错误

[英]Task not serializable error in Spark Scala

I am trying to read a csv file into an RDD in Spark (Using Scala). 我试图将一个csv文件读入Spark中的RDD(使用Scala)。 I have made a function to first filter data so that it doesn't take the header into consideration. 我已经做了一个函数来首先过滤数据,以便它不考虑标题。

def isHeader(line: String): Boolean = {
line.contains("id_1")
}

and then I am running the following command: 然后我运行以下命令:

val noheader = rawblocks.filter(x => !isHeader(x))

The rawblocks RDD reads data from a csv file which is 26MB in size rawblocks RDD从大小为26MB的csv文件中读取数据

I am getting Task not serializable error. 我收到Task不可序列化的错误。 What can be the solution? 什么可以解决方案?

Most probably, you have defined your isHeader method inside a class which is not serializable. 最有可能的是,您已经在一个不可序列化的类中定义了您的isHeader方法。 As a consequence, isHeader is tied to a non-serializable instance of said class, which is then shipped to executors via the closure. 因此,isHeader绑定到所述类的非可序列化实例,然后通过闭包将其传送给执行程序。

You may want to either define isHeader in a separate object, or make the enclosing class serializable (which is not good practice, as you will still be shipping the entire class instance with your job, which is not intended). 您可能希望在单独的对象中定义isHeader,或者使封闭类可序列化(这不是一个好习惯,因为您仍将使用您的作业运送整个类实例,这不是预期的)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM