Scala Spark 数据帧：即使使用广播变量，任务也无法序列化异常

Question

This works (df : dataframe)这有效（df：数据框）

val filteredRdd = df.rdd.zipWithIndex.collect { case (r, i) if i >= 10 => r }

This doesn't这不

val start=10
val filteredRdd = df.rdd.zipWithIndex.collect { case (r, i) if i >= start => r }

I tried using broadcast variables , but even that didn't work我尝试使用广播变量，但即使这样也不起作用

 val start=sc.broadcast(1)
 val filteredRdd = df.rdd.zipWithIndex.collect { case (r, i) if i >= start.value => r }

I am getting Task Not Serializable exception.我收到 Task Not Serializable 异常。 Can anyone explain why it fails even with broadcast variables.谁能解释为什么即使使用广播变量它也会失败。

org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$collect$2.apply(RDD.scala:959)
at org.apache.spark.rdd.RDD$$anonfun$collect$2.apply(RDD.scala:958)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:958)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$fa17825793f04f8d2edd8765c45e2a6c$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:172)
at $iwC

Answer 1

The basic constructs you are using look solid.您使用的基本结构看起来很可靠。 Here is a similar code snippet that does work.下面是类似的代码片段，做的工作。 Note it uses broadcast and uses the broadcast value inside the map method - similarly to your code.请注意，它使用broadcast并使用map方法中的广播值 - 类似于您的代码。

scala> val dat = sc.parallelize(List(1,2,3))
dat: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> val br = sc.broadcast(10)
br: org.apache.spark.broadcast.Broadcast[Int] = Broadcast(2)

scala> dat.map(br.value * _)
res2: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[1] at map at <console>:29

scala> res2.collect
res3: Array[Int] = Array(10, 20, 30)

So this may help you as a verification of your general approach.因此，这可以帮助您验证您的一般方法。

I suspect your problem were with other variables in your script.我怀疑您的问题与脚本中的其他变量有关。 Try stripping everything out first in a new spark-shell session and find out the culprit by process of elimination.尝试在新的 spark-shell 会话中首先剥离所有内容，然后通过消除过程找出罪魁祸首。

Scala Spark 数据帧：即使使用广播变量，任务也无法序列化异常

问题描述

1 个解决方案

解决方案1
1 2016-05-14 06:52:54

Scala Spark 数据帧：即使使用广播变量，任务也无法序列化异常

问题描述

1 个解决方案

解决方案1 1 2016-05-14 06:52:54

解决方案1
1 2016-05-14 06:52:54