如何處理Spark和Scala中的異常

Question

我正在嘗試處理Spark中的常見異常，例如.map操作無法正確處理數據的所有元素或FileNotFound異常。 我已閱讀所有現有問題和以下兩個帖子：

https://rcardin.github.io/big-data/apache-spark/scala/programming/2016/09/25/try-again-apache-spark.html

https://www.nicolaferraro.me/2016/02/18/exception-handling-in-apache-spark

我在行attributes => mHealthUser(attributes(0).toDouble, attributes(1).toDouble, attributes(2).toDouble嘗試了一個Try語句
所以它讀取attributes => Try(mHealthUser(attributes(0).toDouble, attributes(1).toDouble, attributes(2).toDouble)

但這不會編譯; 編譯器以后不會識別.toDF()語句。 我也嘗試了類似Java的Try {Catch {}}塊但無法正確獲取范圍; 然后不返回df 。 有誰知道如何正確地做到這一點？ 我甚至需要處理這些異常，因為Spark框架似乎已經處理了FileNotFound異常，而我沒有添加一個異常。 但是，如果輸入文件的列數錯誤，我想生成模式中字段數的錯誤。

這是代碼：

object DataLoadTest extends SparkSessionWrapper {
/** Helper function to create a DataFrame from a textfile, re-used in        subsequent tests */
def createDataFrame(fileName: String): DataFrame = {

import spark.implicits._

//try {
val df = spark.sparkContext
  .textFile("/path/to/file" + fileName)
  .map(_.split("\\t"))
//mHealth user is the case class which defines the data schema
  .map(attributes => mHealthUser(attributes(0).toDouble, attributes(1).toDouble, attributes(2).toDouble,
        attributes(3).toDouble, attributes(4).toDouble,
        attributes(5).toDouble, attributes(6).toDouble, attributes(7).toDouble,
        attributes(8).toDouble, attributes(9).toDouble, attributes(10).toDouble,
        attributes(11).toDouble, attributes(12).toDouble, attributes(13).toDouble,
        attributes(14).toDouble, attributes(15).toDouble, attributes(16).toDouble,
        attributes(17).toDouble, attributes(18).toDouble, attributes(19).toDouble,
        attributes(20).toDouble, attributes(21).toDouble, attributes(22).toDouble,
        attributes(23).toInt))
  .toDF()
  .cache()
df
} catch {
    case ex: FileNotFoundException => println(s"File $fileName not found")
    case unknown: Exception => println(s"Unknown exception: $unknown")

}
}

所有建議都贊賞。 謝謝！

Answer 1

另一種選擇是在scala中使用Try類型。

例如：

def createDataFrame(fileName: String): Try[DataFrame] = {

try {
      //create dataframe df
      Success(df)
    } catch {
      case ex: FileNotFoundException => {
        println(s"File $fileName not found")
        Failure(ex)
      }
      case unknown: Exception => {
        println(s"Unknown exception: $unknown")
        Failure(unknown)
      }
    }
  }

現在，在調用者方面，處理它像：

createDataFrame("file1.csv") match {
  case Success(df) => {
    // proceed with your pipeline
  }
  case Failure(ex) => //handle exception
}

這比使用Option略好，因為調用者會知道失敗的原因並且可以更好地處理。

Answer 2

要么讓異常被拋出createDataFrame方法（並在外面處理它），要么更改簽名以返回Option[DataFrame] ：

  def createDataFrame(fileName: String): Option[DataFrame] = {

    import spark.implicits._

    try {
      val df = spark.sparkContext
        .textFile("/path/to/file" + fileName)
        .map(_.split("\\t"))
        //mHealth user is the case class which defines the data schema
        .map(attributes => mHealthUser(attributes(0).toDouble, attributes(1).toDouble, attributes(2).toDouble,
        attributes(3).toDouble, attributes(4).toDouble,
        attributes(5).toDouble, attributes(6).toDouble, attributes(7).toDouble,
        attributes(8).toDouble, attributes(9).toDouble, attributes(10).toDouble,
        attributes(11).toDouble, attributes(12).toDouble, attributes(13).toDouble,
        attributes(14).toDouble, attributes(15).toDouble, attributes(16).toDouble,
        attributes(17).toDouble, attributes(18).toDouble, attributes(19).toDouble,
        attributes(20).toDouble, attributes(21).toDouble, attributes(22).toDouble,
        attributes(23).toInt))
        .toDF()
        .cache()

      Some(df)
    } catch {
      case ex: FileNotFoundException => {
        println(s"File $fileName not found")
        None
      }
      case unknown: Exception => {
        println(s"Unknown exception: $unknown")
        None
      }
    }
  }

編輯：在createDataFrame的調用者端有幾種模式。 如果您正在處理多個文件名，您可以例如：

 val dfs : Seq[DataFrame] = Seq("file1","file2","file3").map(createDataFrame).flatten

如果您正在使用單個文件名，則可以執行以下操作：

createDataFrame("file1.csv") match {
  case Some(df) => {
    // proceed with your pipeline
    val df2 = df.filter($"activityLabel" > 0).withColumn("binaryLabel", when($"activityLabel".between(1, 3), 0).otherwise(1))
  }
  case None => println("could not create dataframe")
}

Answer 3

在dataframe列上應用try和catch塊：

(try{$"credit.amount"} catch{case e:Exception=> lit(0)}).as("credit_amount")

如何處理Spark和Scala中的異常

問題描述

3 個解決方案

解決方案1
13 已采納 2018-01-12 13:04:03

解決方案2
1 2017-08-26 11:24:19

解決方案3
0 2019-01-24 09:52:19

如何處理Spark和Scala中的異常

問題描述

3 個解決方案

解決方案1 13 已采納 2018-01-12 13:04:03

解決方案2 1 2017-08-26 11:24:19

解決方案3 0 2019-01-24 09:52:19

解決方案1
13 已采納 2018-01-12 13:04:03

解決方案2
1 2017-08-26 11:24:19

解決方案3
0 2019-01-24 09:52:19