在Spark Scala中使用自定義數據框類時無法序列化任務

Question

我面對Scala / Spark（1.5）和Zeppelin的一個奇怪問題：

如果我運行以下Scala / Spark代碼，它將正常運行：

// TEST NO PROBLEM SERIALIZATION
val rdd = sc.parallelize(Seq(1, 2, 3))
val testList = List[String]("a", "b")

rdd.map{a => 
    val aa = testList(0)
    None}

但是，在聲明了自定義數據框類型后，此處提出了建議

//DATAFRAME EXTENSION
import org.apache.spark.sql.DataFrame

object ExtraDataFrameOperations {
  implicit class DFWithExtraOperations(df : DataFrame) {

    //drop several columns
    def drop(colToDrop:Seq[String]):DataFrame = {
        var df_temp = df
        colToDrop.foreach{ case (f: String) =>
            df_temp = df_temp.drop(f)//can be improved with Spark 2.0
        }
        df_temp
    }   
  }
}

並像下面這樣使用它：

//READ ALL THE FILES INTO different DF and save into map
import ExtraDataFrameOperations._
val filename = "myInput.csv"

val delimiter =  ","

val colToIgnore = Seq("c_9", "c_10")

val inputICFfolder = "hdfs:///group/project/TestSpark/"

val df = sqlContext.read
            .format("com.databricks.spark.csv")
            .option("header", "true") // Use first line of all files as header
            .option("inferSchema", "false") // Automatically infer data types? => no cause we need to merge all df, with potential null values => keep string only
            .option("delimiter", delimiter)
            .option("charset", "UTF-8")
            .load(inputICFfolder + filename)
            .drop(colToIgnore)//call the customize dataframe

這樣運行成功。

現在，如果我再次運行以下代碼（與上面相同）

// TEST NO PROBLEM SERIALIZATION
val rdd = sc.parallelize(Seq(1, 2, 3))
val testList = List[String]("a", "b")
rdd.map{a => 
    val aa = testList(0)
    None}

我收到錯誤消息：

rdd：org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [8]並行於：32 testList：List [String] = List（a，b）org.apache.spark.SparkException：任務無法在org上序列化org.apache.spark.util.ClosureCleaner $ .org $ apache $ spark $ util $ ClosureCleaner $$ clean（ClosureCleaner.scala：294）的.apache.spark.util.ClosureCleaner $ .ensureSerializable（ClosureCleaner.scala：304） org.apache.spark.SparkContext.clean（SparkContext.scala：2032）上的org.apache.spark.util.ClosureCleaner $ .clean（ClosureCleaner.scala：122）在org.apache.spark.rdd.RDD $ RDon $ RD map $ 1.apply（RDD.scala：314）...造成原因：java.io.NotSerializableException：$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $ ExtraDataFrameOperations $序列化堆棧：-無法序列化的對象（類：$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $ ExtraDataFrameOperations $，值：$ iwC $$ iwC $ $ iwC $$ iwC $$ iwC $$ iwC $ iwC $$ iwC $$ iwC $ ExtraDataFrameOperations $ @ 6c7e70e）-字段（類別：$ iwC $$ iwC $$ iwC $$ iwC $ iwC $$ iwC $$ iwC $$ iwC $$ iwC，名稱：ExtraDataFrameOperations $ 模塊，類型：類$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $ ExtraDataFrameOperations $）-對象（類$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $ iwC $$ iwC，$ iwC $$ iwC $ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC @ 4c6d0802）-字段（class： $ iwC $$ iwC $$ iwC $$ iwC $ iwC $$ iwC $$ iwC $$ iwC，名稱：$ iw，類型：class $ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $ $ iwC $$ iwC $$ iwC）...

我不明白：

為什么在對數據幀不執行任何操作時發生此錯誤？
為什么“ ExtraDataFrameOperations”在以前成功使用后無法序列化？

更新：

嘗試

@inline val testList = List[String]("a", "b")

沒有幫助。

Answer 1

看起來spark嘗試序列化testList周圍的所有范圍。 嘗試內聯數據@inline val testList = List[String]("a", "b")或使用其他對象來存儲傳遞給驅動程序的函數/數據。

Answer 2

只需添加“ extends Serializable”即可為我工作

/**
   * A wrapper around ProducerRecord RDD that allows to save RDD to Kafka.
   *
   * KafkaProducer is shared within all threads in one executor.
   * Error handling strategy - remember "last" seen exception and rethrow it to allow task fail.
   */
 implicit class DatasetKafkaSink(ds: Dataset[ProducerRecord[String, GenericRecord]]) extends Serializable {

   class ExceptionRegisteringCallback extends Callback {
     private[this] val lastRegisteredException = new AtomicReference[Option[Exception]](None)

     override def onCompletion(metadata: RecordMetadata, exception: Exception): Unit = {
       Option(exception) match {
         case a @ Some(_) => lastRegisteredException.set(a) // (re)-register exception if send failed
         case _ => // do nothing if encountered successful send
       }
     }

     def rethrowException(): Unit = lastRegisteredException.getAndSet(None).foreach(e => throw e)
   }

   /**
     * Save to Kafka reusing KafkaProducer from singleton holder.
     * Returns back control only once all records were actually sent to Kafka, in case of error rethrows "last" seen
     * exception in the same thread to allow Spark task to fail
     */
   def saveToKafka(kafkaProducerConfigs: Map[String, AnyRef]): Unit = {
     ds.foreachPartition { records =>
       val callback = new ExceptionRegisteringCallback
       val producer = KafkaProducerHolder.getInstance(kafkaProducerConfigs)

       records.foreach(record => producer.send(record, callback))

       producer.flush()
       callback.rethrowException()
     }
   }
 }'

在Spark Scala中使用自定義數據框類時無法序列化任務

問題描述

2 個解決方案

解決方案1
0 2016-07-19 10:51:53

解決方案2
0 2019-05-03 09:10:49

在Spark Scala中使用自定義數據框類時無法序列化任務

問題描述

2 個解決方案

解決方案1 0 2016-07-19 10:51:53

解決方案2 0 2019-05-03 09:10:49

解決方案1
0 2016-07-19 10:51:53

解決方案2
0 2019-05-03 09:10:49