[英]Task not serializable while using custom dataframe class in Spark Scala
我面對Scala / Spark(1.5)和Zeppelin的一個奇怪問題:
如果我運行以下Scala / Spark代碼,它將正常運行:
// TEST NO PROBLEM SERIALIZATION
val rdd = sc.parallelize(Seq(1, 2, 3))
val testList = List[String]("a", "b")
rdd.map{a =>
val aa = testList(0)
None}
但是,在聲明了自定義數據框類型后, 此處提出了建議
//DATAFRAME EXTENSION
import org.apache.spark.sql.DataFrame
object ExtraDataFrameOperations {
implicit class DFWithExtraOperations(df : DataFrame) {
//drop several columns
def drop(colToDrop:Seq[String]):DataFrame = {
var df_temp = df
colToDrop.foreach{ case (f: String) =>
df_temp = df_temp.drop(f)//can be improved with Spark 2.0
}
df_temp
}
}
}
並像下面這樣使用它:
//READ ALL THE FILES INTO different DF and save into map
import ExtraDataFrameOperations._
val filename = "myInput.csv"
val delimiter = ","
val colToIgnore = Seq("c_9", "c_10")
val inputICFfolder = "hdfs:///group/project/TestSpark/"
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "false") // Automatically infer data types? => no cause we need to merge all df, with potential null values => keep string only
.option("delimiter", delimiter)
.option("charset", "UTF-8")
.load(inputICFfolder + filename)
.drop(colToIgnore)//call the customize dataframe
這樣運行成功。
現在,如果我再次運行以下代碼(與上面相同)
// TEST NO PROBLEM SERIALIZATION
val rdd = sc.parallelize(Seq(1, 2, 3))
val testList = List[String]("a", "b")
rdd.map{a =>
val aa = testList(0)
None}
我收到錯誤消息:
rdd:org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [8]並行於:32 testList:List [String] = List(a,b)org.apache.spark.SparkException:任務無法在org上序列化org.apache.spark.util.ClosureCleaner $ .org $ apache $ spark $ util $ ClosureCleaner $$ clean(ClosureCleaner.scala:294)的.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:304) org.apache.spark.SparkContext.clean(SparkContext.scala:2032)上的org.apache.spark.util.ClosureCleaner $ .clean(ClosureCleaner.scala:122)在org.apache.spark.rdd.RDD $ RDon $ RD map $ 1.apply(RDD.scala:314)...造成原因:java.io.NotSerializableException:$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $ ExtraDataFrameOperations $序列化堆棧:-無法序列化的對象(類:$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $ ExtraDataFrameOperations $,值:$ iwC $$ iwC $ $ iwC $$ iwC $$ iwC $$ iwC $ iwC $$ iwC $$ iwC $ ExtraDataFrameOperations $ @ 6c7e70e)-字段(類別:$ iwC $$ iwC $$ iwC $$ iwC $ iwC $$ iwC $$ iwC $$ iwC $$ iwC,名稱:ExtraDataFrameOperations $ 模塊,類型:類$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $ ExtraDataFrameOperations $)-對象(類$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $ iwC $$ iwC,$ iwC $$ iwC $ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC @ 4c6d0802)-字段(class: $ iwC $$ iwC $$ iwC $$ iwC $ iwC $$ iwC $$ iwC $$ iwC,名稱:$ iw,類型:class $ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $ $ iwC $$ iwC $$ iwC)...
我不明白:
更新:
嘗試
@inline val testList = List[String]("a", "b")
沒有幫助。
看起來spark嘗試序列化testList
周圍的所有范圍。 嘗試內聯數據@inline val testList = List[String]("a", "b")
或使用其他對象來存儲傳遞給驅動程序的函數/數據。
只需添加“ extends Serializable”即可為我工作
/**
* A wrapper around ProducerRecord RDD that allows to save RDD to Kafka.
*
* KafkaProducer is shared within all threads in one executor.
* Error handling strategy - remember "last" seen exception and rethrow it to allow task fail.
*/
implicit class DatasetKafkaSink(ds: Dataset[ProducerRecord[String, GenericRecord]]) extends Serializable {
class ExceptionRegisteringCallback extends Callback {
private[this] val lastRegisteredException = new AtomicReference[Option[Exception]](None)
override def onCompletion(metadata: RecordMetadata, exception: Exception): Unit = {
Option(exception) match {
case a @ Some(_) => lastRegisteredException.set(a) // (re)-register exception if send failed
case _ => // do nothing if encountered successful send
}
}
def rethrowException(): Unit = lastRegisteredException.getAndSet(None).foreach(e => throw e)
}
/**
* Save to Kafka reusing KafkaProducer from singleton holder.
* Returns back control only once all records were actually sent to Kafka, in case of error rethrows "last" seen
* exception in the same thread to allow Spark task to fail
*/
def saveToKafka(kafkaProducerConfigs: Map[String, AnyRef]): Unit = {
ds.foreachPartition { records =>
val callback = new ExceptionRegisteringCallback
val producer = KafkaProducerHolder.getInstance(kafkaProducerConfigs)
records.foreach(record => producer.send(record, callback))
producer.flush()
callback.rethrowException()
}
}
}'
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.