为什么在过滤器中使用集合会导致“ org.apache.spark.SparkException：任务无法序列化”？

Question

I am trying to filter a collection of objects, that are in a RDD, based on a field of these objects being in a list. 我正在尝试根据列表中这些对象的字段来筛选RDD中的对象集合。

The approach I am trying the same as here: Filter based on another RDD in Spark 我尝试的方法与此处相同：基于Spark中另一个RDD的过滤器

val namesToFilterOn = sc.textFile("/names_to_filter_on.txt").collect.toSet

val usersRDD = userContext.loadUsers("/user.parquet")

This works: 这有效：

usersRDD.filter(user =>  Set("Pete","John" ).contains( user.firstName )).first

When I try 当我尝试

usersRDD.filter(user => namesToFilterOn.contains( user.firstName )).first

I get this error 我得到这个错误

org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext

The same error I get when I try this 尝试此操作时遇到的相同错误

val shortTestList = Set("Pete","John" )

usersRDD.filter(user => shortTestList .contains( user.firstName )).first

Why do I get this errer when specifying a Set of names/String in these filter statements? 在这些过滤器语句中指定一组名称/字符串时，为什么会出现此错误？

As far as I can see this should work, I a not specifying the SparkContext anywhere in the filter statements. 据我认为这应该工作，我没有在filter语句的任何地方指定SparkContext。 So why the error? 那么为什么会出错呢？ And how not to get the error? 以及如何不得到错误？

The version of Spark that I am using is 1.5.2. 我正在使用的Spark版本是1.5.2。

I also tried to first broadcast the Set of names. 我还尝试过首先广播名称集。

val namesToFilterOnBC = sc.broadcast(namesToFilterOn)
usersRDD.filter(user => namesToFilterOnBC.value.contains( user.firstName )).first

This leads again to the same error 这再次导致相同的错误

org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext

Answer 1

The reason is that val namesToFilterOn = sc.textFile("/names_to_filter_on.txt").collect.toSet belongs to an object that contains unserializable vals and hence the error. 原因是val namesToFilterOn = sc.textFile("/names_to_filter_on.txt").collect.toSet属于包含无法序列化val namesToFilterOn = sc.textFile("/names_to_filter_on.txt").collect.toSet的对象，因此是错误。

When user => namesToFilterOn.contains( user.firstName ) is transformed into a byte format to send to executors over the wire, Spark checks whether there are any references to unserializable objects and SparkContext is among them. 当user => namesToFilterOn.contains( user.firstName )转换为字节格式以通过网络发送给执行者时，Spark将检查是否存在对不可序列化对象的引用，并且SparkContext是否在其中。

It appears that Spark found a place where you reference a non-serializable SparkContext and threw the exception. 似乎Spark找到了一个引用不可序列化SparkContext的地方，并引发了异常。

A solution is to wrap val namesToFilterOn = sc.textFile("/names_to_filter_on.txt").collect.toSet or val shortTestList = Set("Pete","John" ) as separate methods of an object in Scala. 一种解决方案是将val namesToFilterOn = sc.textFile("/names_to_filter_on.txt").collect.toSet或val shortTestList = Set("Pete","John" )为Scala中object单独方法。 You can also use the other val shortTestList inside the closure (as described in Job aborted due to stage failure: Task not serializable ) or broadcast it. 您还可以使用闭包内部的另一个val shortTestList （如Job由于阶段故障而中止：任务不可序列化中所述）或广播它。

You may find the document SIP-21 - Spores quite informatory for the case. 您可能会发现文件SIP-21-Spores对于这种情况很有帮助。

Answer 2

Asked the developers of userContext and solved the issue by not explicitly instantiating userContext but by just importing its functions. 询问userContext的开发人员，并通过不显式实例化userContext而是仅导入其功能来解决此问题。

import userContext._
sc.loadUsers("/user.parquet")
usersRDD.filter(user => namesToFilterOn.contains( user.firstName )).first

instead of 代替

val userContext = new UserContext(sc)
userContext.loadUsers("/user.parquet")
usersRDD.filter(user => namesToFilterOn.contains( user.firstName )).first

为什么在过滤器中使用集合会导致“ org.apache.spark.SparkException：任务无法序列化”？

问题描述

2 个解决方案

解决方案1
1 2015-12-01 13:12:54

解决方案2
0 已采纳 2015-12-02 16:46:25

为什么在过滤器中使用集合会导致“ org.apache.spark.SparkException：任务无法序列化”？

问题描述

2 个解决方案

解决方案1 1 2015-12-01 13:12:54

解决方案2 0 已采纳 2015-12-02 16:46:25

解决方案1
1 2015-12-01 13:12:54

解决方案2
0 已采纳 2015-12-02 16:46:25