Why does using a set in filter cause “org.apache.spark.SparkException: Task not serializable”?

Question

I am trying to filter a collection of objects, that are in a RDD, based on a field of these objects being in a list.

The approach I am trying the same as here: Filter based on another RDD in Spark

val namesToFilterOn = sc.textFile("/names_to_filter_on.txt").collect.toSet

val usersRDD = userContext.loadUsers("/user.parquet")

This works:

usersRDD.filter(user =>  Set("Pete","John" ).contains( user.firstName )).first

When I try

usersRDD.filter(user => namesToFilterOn.contains( user.firstName )).first

I get this error

org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext

The same error I get when I try this

val shortTestList = Set("Pete","John" )

usersRDD.filter(user => shortTestList .contains( user.firstName )).first

Why do I get this errer when specifying a Set of names/String in these filter statements?

As far as I can see this should work, I a not specifying the SparkContext anywhere in the filter statements. So why the error? And how not to get the error?

The version of Spark that I am using is 1.5.2.

I also tried to first broadcast the Set of names.

val namesToFilterOnBC = sc.broadcast(namesToFilterOn)
usersRDD.filter(user => namesToFilterOnBC.value.contains( user.firstName )).first

This leads again to the same error

org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext

Answer 1

The reason is that val namesToFilterOn = sc.textFile("/names_to_filter_on.txt").collect.toSet belongs to an object that contains unserializable vals and hence the error.

When user => namesToFilterOn.contains( user.firstName ) is transformed into a byte format to send to executors over the wire, Spark checks whether there are any references to unserializable objects and SparkContext is among them.

It appears that Spark found a place where you reference a non-serializable SparkContext and threw the exception.

A solution is to wrap val namesToFilterOn = sc.textFile("/names_to_filter_on.txt").collect.toSet or val shortTestList = Set("Pete","John" ) as separate methods of an object in Scala. You can also use the other val shortTestList inside the closure (as described in Job aborted due to stage failure: Task not serializable ) or broadcast it.

You may find the document SIP-21 - Spores quite informatory for the case.

Answer 2

Asked the developers of userContext and solved the issue by not explicitly instantiating userContext but by just importing its functions.

import userContext._
sc.loadUsers("/user.parquet")
usersRDD.filter(user => namesToFilterOn.contains( user.firstName )).first

instead of

val userContext = new UserContext(sc)
userContext.loadUsers("/user.parquet")
usersRDD.filter(user => namesToFilterOn.contains( user.firstName )).first

Why does using a set in filter cause “org.apache.spark.SparkException: Task not serializable”?

Question

2 answers

solution1
1 2015-12-01 13:12:54

solution2
0 ACCPTED 2015-12-02 16:46:25

Why does using a set in filter cause “org.apache.spark.SparkException: Task not serializable”?

Question

2 answers

solution1 1 2015-12-01 13:12:54

solution2 0 ACCPTED 2015-12-02 16:46:25

solution1
1 2015-12-01 13:12:54

solution2
0 ACCPTED 2015-12-02 16:46:25