繁体   English   中英

任务无法序列化错误:火花

[英]Task not Serializable error:Spark

我有一个形式为(String,(Int,Iterable[String]))的RDD。 对于RDD中的每个条目,整数值(我称之为距离)最初设置为10。 Iterable[String]中的每个元素在此RDD中都有自己的条目,在其中它作为键(因此,我们在单独的rdd条目中具有Iterable[String]中每个元素的距离)。 我的意图是执行以下操作:
1.如果列表( Iterable[String] )包含元素“ Bethan”,则将其距离分配为1。
2.此后,我通过过滤创建了所有距离为1的键的列表。
3之后,我将RDD转换为新的RDD,如果其自身列表中的任何元素的距离为1,则将其距离值更新为2。
我有以下代码:

val disOneRdd = disRdd.map(x=> {if(x._2._2.toList.contains("Bethan")) (x._1,(1,x._2._2)) else x})
    var lst = disRdd.filter(x=> x._2._1 == 1).keys.collect
    val disTwoRdd = disRdd.map(x=> {
                    var b:Boolean = false
                    loop.breakable{
                        for (str <- x._2._2)
                       if (lst.contains(str)) //checks if it contains element with distance 1
                        b = true
                        loop.break
                    }
                    if (b)
                        (x._1,(2,x._2._2))
                    else    
                        (x._1,(10,x._2._2))
               })

但是,当我运行它时,出现错误“任务不可序列化”。 我该怎么做,还有更好的方法吗?

编辑

输入形式的RDD:

("abc",(10,List("efg","hij","klm")))
("efg",(10,List("jhg","Beethan","abc","ert")))
("Beethan",(0,List("efg","vcx","zse")))
("vcx",(10,List("czx","Beethan","abc")))
("zse",(10,List("efg","Beethan","nbh")))
("gvf",(10,List("vcsd","fdgd")))
...

列表中包含Beethan的每个元素都应具有距离1。每个具有“距离为1的元素”(而不是Beethan)的元素应具有距离2。out的形式为:

("abc",(2,List("efg","hij","klm")))
("efg",(1,List("jhg","Beethan","abc","ert")))
("Beethan",(0,List("efg","vcx","zse")))
("vcx",(1,List("czx","Beethan","abc")))
("zse",(1,List("efg","Beethan","nbh"))
("gvf",(10,List("vcsd","fdgd")))
...

错误信息:

[error] (run-main-0) org.apache.spark.SparkException: Task not serializable
org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at   org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2037)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365)
at   org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.map(RDD.scala:365)
at Bacon$.main(Bacon.scala:86)
at Bacon.main(Bacon.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
Caused by: java.io.NotSerializableException: scala.util.control.Breaks
Serialization stack:
- object not serializable (class: scala.util.control.Breaks, value: scala.util.control.Breaks@78426203)
- field (class: Bacon$$anonfun$15, name: loop$1, type: class  scala.util.control.Breaks)
- object (class Bacon$$anonfun$15, <function1>)
val disOneRdd = disRdd.map(x=> {if(x._2._2.toList.contains("Bethan")) (x._1,(1,x._2._2)) else x})
var lst = disRdd.filter(x=> x._2._1 == 1).keys.collect
val disTwoRdd = disRdd.map(x=> {
    var b:Boolean = x._._2.filter(y => lst.contains(y)).size() > 0
    if (b)
        (x._1,(2,x._2._2))
    else    
        (x._1,(10,x._2._2))
    })

要么

import scala.util.control.Breaks._
val disOneRdd = disRdd.map(x=> {if(x._2._2.toList.contains("Bethan")) (x._1,(1,x._2._2)) else x})
var lst = disRdd.filter(x=> x._2._1 == 1).keys.collect
val disTwoRdd = disRdd.map(x=> {
    var b:Boolean = false
    breakable{
        for (str <- x._2._2)
        if (lst.contains(str)) //checks if it contains element with distance 1
            b = true
            break
    }
    if (b)
        (x._1,(2,x._2._2))
    else    
        (x._1,(10,x._2._2))
    })

这两个版本都适合我。 问题是loop.breakable无法序列化。 说实话,我不知道这个建筑的行为发生了变化,但更换后loop.breakablebreakable它的工作原理-也许有一些API的变化。 带过滤器的版本可能较慢,但避免了breakable问题

尽管存在主要问题,但lst应该是广播变量-但是我没有在此处放置广播变量以尽可能提供简单的答案

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM