Task not serializable about aggegateByKey

Question

Environment: spark 1.60. i use scala. i can compile the program by sbt, but when i commit the program, it came across the error. My full error is as followed:

238 17/01/21 18:32:24 INFO net.NetworkTopology: Adding a new node: /YH11070029/10.39.0.213:50010
17/01/21 18:32:24 INFO storage.BlockManagerMasterEndpoint: Registering block  manager 10.39.0.44:41961 with 2.7 GB RAM, BlockManagerId(349, 10.39.0.44, 41961)
17/01/21 18:32:24 INFO storage.BlockManagerMasterEndpoint: Registering block manager 10.39.2.178:48591 with 2.7 GB RAM, BlockManagerId(518, 10.39.2.178,  48591)
Exception in thread "main" org.apache.spark.SparkException: Task not     serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKeyWithClassTag$1.apply(PairRDDFunctions.scala:93)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKeyWithClassTag$1.apply(PairRDDFunctions.scala:82)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:82)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$aggregateByKey$1.apply(PairRDDFunctions.scala:177)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$aggregateByKey$1.apply(PairRDDFunctions.scala:166)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.PairRDDFunctions.aggregateByKey(PairRDDFunctions.scala:166)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$aggregateByKey$3.apply(PairRDDFunctions.scala:206)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$aggregateByKey$3.apply(PairRDDFunctions.scala:206)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.PairRDDFunctions.aggregateByKey(PairRDDFunctions.scala:205)
    at com.sina.adalgo.feature.ETL$$anonfun$13.apply(ETL.scala:190)
    at com.sina.adalgo.feature.ETL$$anonfun$13.apply(ETL.scala:102)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

The purpose of code is to statistic categorical features' frequentencies. Main code is as followed:

object ETL extends Serializable {
           ... ...


val cateList = featureData.map{v =>
    case (psid: String, label: String, cate_features: ParArray[String], media_features: String) =>
        val pair_feature = cate_features.zipWithIndex.map(x => (x._2, x._1))
        pair_feature
}.flatMap(_.toList)

def seqop(m: HashMap[String, Int] , s: String) : HashMap[String, Int]={
    var x = m.getOrElse(s, 0)
    x += 1
    m += s -> x
    m   
}   

def combop(m: HashMap[String, Int], n: HashMap[String, Int]) : HashMap[String, Int]={
    for (k <- n) {
        var x = m.getOrElse(k._1, 0)
        x += k._2
        m += k._1 -> x
    }   
    m   
}   

val hash = HashMap[String, Int]()
val feaFreq = cateList.aggregateByKey(hash)(seqop, combop)// (i, HashMap[String, Int]) i corresponded with categorical feature

The object have inheritated Serializable. why? can u help me?

Answer 1

To me, this problem typically happens in Spark when we use a closure as aggregation function that un-intentially closes over some unwanted objects and/or sometimes simply a function that is inside the main class of our spark driver code.

I suspect this might be the case here since your stacktrace involves org.apache.spark.util.ClosureCleaner as top level culprit.

This is problematic because in such case, when Spark tries to forward that function to the workers so they can do the actual aggregation, it ends-up serializing much more than you actually intended: the function itelf plus its surrounding class.

See also this post by Erik Erlandson where some border cases of serialisation of closure are well explained as well as Spark 1.6 notes on closures .

A quick fix is probably to move the definition of the function you use in the aggregateByKey to a separate object, completely independant from the rest of the code.

Task not serializable about aggegateByKey

Question

1 answers

solution1
0 2017-01-21 13:01:54

Task not serializable about aggegateByKey

Question

1 answers

solution1 0 2017-01-21 13:01:54

solution1
0 2017-01-21 13:01:54