[英]org.apache.spark.SparkException: Task not serializable (Caused by org.apache.hadoop.conf.Configuration)
我想將轉換后的流寫入Elasticsearch索引,如下所示:
transformed.foreachRDD(rdd => {
if (!rdd.isEmpty()) {
val messages = rdd.map(prepare)
messages.saveAsNewAPIHadoopFile("-", classOf[NullWritable], classOf[MapWritable], classOf[EsOutputFormat], ec)
}
})
行val messages = rdd.map(prepare)
引發錯誤(請參見下文)。 我陷入了嘗試以其他方式解決此問題的@transient
(例如,在val conf
旁邊添加@transient
),但是似乎沒有任何效果。
6/06/28 19:23:00錯誤JobScheduler:運行作業流作業作業時出錯1467134580000 ms.0 org.apache.spark.SparkException:任務無法在org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala :304),位於org.apache.spark.util.ClosureCleaner $ .clean(ClosureCleaner.org)上,網址為org.apache.spark.util.ClosureCleaner $ .org $ apache $ spark $ util $ ClosureCleaner $$ clean(ClosureCleaner.scala:294)。 scala:122)位於org.apache.spark.SparkContext.clean(SparkContext.scala:2055)位於org.apache.spark.rdd.RDD $$ anonfun $ map $ 1.apply(RDD.scala:324)位於org.apache位於org.apache.spark.rdd的.spark.rdd.RDD $$ anonfun $ map $ 1.apply(RDD.scala:323),位於org.apache.spark.rdd的RDDOperationScope $ .withScope(RDDOperationScope.scala:150)。 org.apache.spark.rdd.RDD.withScope(RDD.scala:316)的RDDOperationScope $ .withScope(RDDOperationScope.scala:111)在org.apache.spark.rdd.RDD.map(RDD.scala:323) de.kp.spark.elastic.stream.EsStream $$ anonfun $ run $ 1.apply(EsStream.scala:77)位於de.kp.spark.elastic.stream.EsStream $$ anonfun $ run $ 1.在org.apache.spark.streaming.dstream.DStream.Dstream $$ anonfun $ foreachRDD $ 1 $ anonfun $ apply $ mcV $ sp $ 3.apply(DStream.scala:661)處應用(EsStream.scala:75)。 apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ apply $ mcV $ sp $ 3.apply(DStream.scala:661)位於org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1在org.apache.spark.streaming.dstream的$$ anonfun $ apply $ mcV $ sp $ 1.apply $ mcV $ sp(ForEachDStream.scala:50).ForEachDStream $$ anonfun $ 1 $$ anonfun $ apply $ mcV $ sp $ 1。在org.apache.spark.apply(ForEachDStream.scala:50)在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(ForEachDStream.scala:50)在org.apache.spark。 org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.apply $ mcV $ sp(ForEachDStream.scala:49)處的org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)。 org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.apply(ForEachDStream)的streaming.dstream.ForEachDStream $$ anonfun $ 1.apply(ForEachDStream.scala:49) .scala:49)在scala.util.Try $ .apply(Try.scala:161)在org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)在org.apache.spark.streaming .scheduler.JobScheduler $ JobHandler $$ anonfun $ run $ 1.apply $ mcV $ sp(JobScheduler.scala:224)在org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $ run $ 1.apply(JobScheduler。 scala:224)在org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $ run $ 1.apply(JobScheduler.scala:224)在scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)在org .apache.spark.streaming.scheduler.JobScheduler $ JobHandler.run(JobScheduler.scala:223)在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)在java.util.concurrent.ThreadPoolExecutor $ Worker.run (ThreadPoolExecutor.java:617)at java.lang.Thread.run(Thread.java:745)原因:java.io.NotSerializableException:org.apache.hadoop.conf.Configuration序列化堆棧:-無法序列化的對象(類: org.apache.hadoop.conf.Configuration,值:Config 提示:core-default.xml,core-site.xml,mapred-default.xml,mapred-site.xml,yarn-default.xml,yarn-site.xml)-字段(類:de.kp.spark.elastic .stream.EsStream,名稱:de $ kp $ spark $ elastic $ stream $ EsStream $$ conf,類型:類org.apache.hadoop.conf.Configuration)-對象(類de.kp.spark.elastic.stream.EsStream ,de.kp.spark.elastic.stream.EsStream@6b156e9a)-字段(類:de.kp.spark.elastic.stream.EsStream $$ anonfun $ run $ 1,名稱:$ outer,類型:class de.kp. spark.elastic.stream.EsStream)-對象(類de.kp.spark.elastic.stream.EsStream $$ anonfun $ run $ 1,)-字段(類:de.kp.spark.elastic.stream.EsStream $$ anonfun $ run $ 1 $$ anonfun $ 2,名稱:$ outer,類型:class de.kp.spark.elastic.stream.EsStream $$ anonfun $ run $ 1)-對象(class.de.kp.spark.elastic.stream.EsStream $ org.apache.spark.serializer.SerializationDebugger $ .improveException(SerializationDebugger.scala:40)處的org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)的$ anonfun $ run $ 1 $ anonfun $ 2,)。在 org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:301)上的org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)...還有30個線程“ main” org中的異常.apache.spark.SparkException:任務無法在org.apache.spark.util.ClosureCleaner $ .org $ apache $ spark $ util $上的org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:304)進行序列化org.apache.spark.util.ClosureCleaner $ .clean(ClosureCleaner.scala:122)的org.apache.spark.SparkContext.clean(SparkContext.scala:2055)的ClosureCleaner $$ clean(ClosureCleaner.scala:294) org.apache.spark.rdd.RDD $$ anonfun $ map $ 1.apply(RDD.scala:324)org.apache.spark.rdd.RDD $$ anonfun $ map $ 1.apply(RDD.scala:323)at org org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:111)在org.apache.spark.rdd.RDD.withScope的.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:150) (RDD.scala:316),網址為org.apache.spark.rdd.RDD.map(RDD.scala:323),網址為de.k p.spark.elastic.stream.EsStream $$ anonfun $ run $ 1.apply(EsStream.scala:77)在de.kp.spark.elastic.stream.EsStream $$ anonfun $ run $ 1.apply(EsStream.scala:75 )org.apache.spark.streaming.DStream上的org.apache.spark.streaming.dstream.DStream $ Dananfun $ foreachRDD $ 1 $$ anonfun $ apply $ mcV $ sp $ 3.apply(DStream.scala:661) $$ anonfun $ foreachRDD $ 1 $ anonfun $ apply $ mcV $ sp $ 3.apply(DStream.scala:661)位於org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply $ mcV $ sp(ForEachDStream.scala:50)位於org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(ForEachDStream.scala:50) org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream)上的org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $ ananfun $ apply $ mcV $ sp $ 1.apply(ForEachDStream.scala:50) .scala:426),位於org.apache.spark.streaming.dstream.ForEachDStream $$ a上org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.apply $ mcV $ sp(ForEachDStream.scala:49)上。 org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.apply(ForEachDStream.scala:49)處的非樂趣$ 1.apply(ForEachDStream.scala:49)在scala.util.Try $ .apply(Try.scala: 161),位於org.apache.spark.streaming.scheduler.Job.run(Job.scala:39),位於org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $ run $ 1.apply $ mcV $ sp( JobScheduler.scala:224)在org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $ run $ 1.apply(JobScheduler.scala:224)在org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $ $ anonfun $ run $ 1.apply(JobScheduler.scala:224)位於scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)位於org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler.run(JobScheduler.scala: 223)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)at java.lang.Thread.run(Thread.java) :745)由:java.io.NotSerializableException:org.apache.hadoop.c onf.Configuration序列化堆棧:-無法序列化的對象(類:org.apache.hadoop.conf.Configuration,值:配置:core-default.xml,core-site.xml,mapred-default.xml,mapred-site.xml ,yarn-default.xml,yarn-site.xml)-字段(類:de.kp.spark.elastic.stream.EsStream,名稱:de $ kp $ spark $ elastic $ stream $ EsStream $$ conf,類型:class org.apache.hadoop.conf.Configuration)-對象(類de.kp.spark.elastic.stream.EsStream,de.kp.spark.elastic.stream.EsStream@6b156e9a)-字段(類:de.kp.spark .elastic.stream.EsStream $$ anonfun $ run $ 1,名稱:$ outer,類型:class de.kp.spark.elastic.stream.EsStream)-對象(class.de.kp.spark.elastic.stream.EsStream $$ anonfun $ run $ 1,)-字段(類:de.kp.spark.elastic.stream.EsStream $$ anonfun $ run $ 1 $$ anonfun $ 2,名稱:$ outer,類型:class de.kp.spark.elastic.stream .EsStream $$ anonfun $ run $ 1)-對象(class.de.kp.spark.elastic.stream.EsStream $$ anonfun $ run $ 1 $$ anonfun $ 2,)在org.apache.spark.serializer.SerializationDebugger $ .improveException(絲氨酸 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)處的ializationDebugger.scala:40)org.apache.spark處的org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)處的ializationDebugger.scala:40) .util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:301)...還有30個
它與Hadoop的配置有某種關系嗎? (我引用此消息: class: org.apache.hadoop.conf.Configuration, value: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml
)
更新:
class EsStream(name:String,conf:HConf) extends SparkBase with Serializable {
/* Elasticsearch configuration */
val ec = getEsConf(conf)
/* Kafka configuration */
val (kc,topics) = getKafkaConf(conf)
def run() {
val ssc = createSSCLocal(name,conf)
/*
* The KafkaInputDStream returns a Tuple where only the second component
* holds the respective message; we therefore reduce to a DStream[String]
*/
val stream = KafkaUtils.createStream[String,String,StringDecoder,StringDecoder](ssc,kc,topics,StorageLevel.MEMORY_AND_DISK).map(_._2)
/*
* Inline transformation of the incoming stream by any function that maps
* a DStream[String] onto a DStream[String]
*/
val transformed = transform(stream)
/*
* Write transformed stream to Elasticsearch index
*/
transformed.foreachRDD(rdd => {
if (!rdd.isEmpty()) {
val messages = rdd.map(prepare)
messages.saveAsNewAPIHadoopFile("-", classOf[NullWritable], classOf[MapWritable], classOf[EsOutputFormat], ec)
}
})
ssc.start()
ssc.awaitTermination()
}
def transform(stream:DStream[String]) = stream
private def getEsConf(config:HConf):HConf = {
val _conf = new HConf()
_conf.set("es.nodes", conf.get("es.nodes"))
_conf.set("es.port", conf.get("es.port"))
_conf.set("es.resource", conf.get("es.resource"))
_conf
}
private def getKafkaConf(config:HConf):(Map[String,String],Map[String,Int]) = {
val cfg = Map(
"group.id" -> conf.get("kafka.group"),
"zookeeper.connect" -> conf.get("kafka.zklist"),
"zookeeper.connection.timeout.ms" -> conf.get("kafka.timeout")
)
val topics = conf.get("kafka.topics").split(",").map((_,conf.get("kafka.threads").toInt)).toMap
(cfg,topics)
}
private def prepare(message:String):(Object,Object) = {
val m = JSON.parseFull(message) match {
case Some(map) => map.asInstanceOf[Map[String,String]]
case None => Map.empty[String,String]
}
val kw = NullWritable.get
val vw = new MapWritable
for ((k, v) <- m) vw.put(new Text(k), new Text(v))
(kw, vw)
}
}
從conf:HConf
的類構造函數中EsStream
conf:HConf
,並將其編寫為class EsStream(name:String)
。
接下來創建一個帶有簽名的方法: public def init(conf:HConf):Map(String,String)
在這種方法中,您將閱讀所需的配置並以此更新ec
和(kc,topics)
。
之后,您應該調用run方法。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.