[英]How to handle exceptions in Spark and Scala
我正在嘗試處理Spark中的常見異常,例如.map操作無法正確處理數據的所有元素或FileNotFound異常。 我已閱讀所有現有問題和以下兩個帖子:
https://www.nicolaferraro.me/2016/02/18/exception-handling-in-apache-spark
我在行attributes => mHealthUser(attributes(0).toDouble, attributes(1).toDouble, attributes(2).toDouble
嘗試了一個Try語句
所以它讀取attributes => Try(mHealthUser(attributes(0).toDouble, attributes(1).toDouble, attributes(2).toDouble)
但這不會編譯; 編譯器以后不會識別.toDF()
語句。 我也嘗試了類似Java的Try {Catch {}}塊但無法正確獲取范圍; 然后不返回df
。 有誰知道如何正確地做到這一點? 我甚至需要處理這些異常,因為Spark框架似乎已經處理了FileNotFound異常,而我沒有添加一個異常。 但是,如果輸入文件的列數錯誤,我想生成模式中字段數的錯誤。
這是代碼:
object DataLoadTest extends SparkSessionWrapper {
/** Helper function to create a DataFrame from a textfile, re-used in subsequent tests */
def createDataFrame(fileName: String): DataFrame = {
import spark.implicits._
//try {
val df = spark.sparkContext
.textFile("/path/to/file" + fileName)
.map(_.split("\\t"))
//mHealth user is the case class which defines the data schema
.map(attributes => mHealthUser(attributes(0).toDouble, attributes(1).toDouble, attributes(2).toDouble,
attributes(3).toDouble, attributes(4).toDouble,
attributes(5).toDouble, attributes(6).toDouble, attributes(7).toDouble,
attributes(8).toDouble, attributes(9).toDouble, attributes(10).toDouble,
attributes(11).toDouble, attributes(12).toDouble, attributes(13).toDouble,
attributes(14).toDouble, attributes(15).toDouble, attributes(16).toDouble,
attributes(17).toDouble, attributes(18).toDouble, attributes(19).toDouble,
attributes(20).toDouble, attributes(21).toDouble, attributes(22).toDouble,
attributes(23).toInt))
.toDF()
.cache()
df
} catch {
case ex: FileNotFoundException => println(s"File $fileName not found")
case unknown: Exception => println(s"Unknown exception: $unknown")
}
}
所有建議都贊賞。 謝謝!
另一種選擇是在scala中使用Try類型。
例如:
def createDataFrame(fileName: String): Try[DataFrame] = {
try {
//create dataframe df
Success(df)
} catch {
case ex: FileNotFoundException => {
println(s"File $fileName not found")
Failure(ex)
}
case unknown: Exception => {
println(s"Unknown exception: $unknown")
Failure(unknown)
}
}
}
現在,在調用者方面,處理它像:
createDataFrame("file1.csv") match {
case Success(df) => {
// proceed with your pipeline
}
case Failure(ex) => //handle exception
}
這比使用Option略好,因為調用者會知道失敗的原因並且可以更好地處理。
要么讓異常被拋出createDataFrame
方法(並在外面處理它),要么更改簽名以返回Option[DataFrame]
:
def createDataFrame(fileName: String): Option[DataFrame] = {
import spark.implicits._
try {
val df = spark.sparkContext
.textFile("/path/to/file" + fileName)
.map(_.split("\\t"))
//mHealth user is the case class which defines the data schema
.map(attributes => mHealthUser(attributes(0).toDouble, attributes(1).toDouble, attributes(2).toDouble,
attributes(3).toDouble, attributes(4).toDouble,
attributes(5).toDouble, attributes(6).toDouble, attributes(7).toDouble,
attributes(8).toDouble, attributes(9).toDouble, attributes(10).toDouble,
attributes(11).toDouble, attributes(12).toDouble, attributes(13).toDouble,
attributes(14).toDouble, attributes(15).toDouble, attributes(16).toDouble,
attributes(17).toDouble, attributes(18).toDouble, attributes(19).toDouble,
attributes(20).toDouble, attributes(21).toDouble, attributes(22).toDouble,
attributes(23).toInt))
.toDF()
.cache()
Some(df)
} catch {
case ex: FileNotFoundException => {
println(s"File $fileName not found")
None
}
case unknown: Exception => {
println(s"Unknown exception: $unknown")
None
}
}
}
編輯:在createDataFrame的調用者端有幾種模式。 如果您正在處理多個文件名,您可以例如:
val dfs : Seq[DataFrame] = Seq("file1","file2","file3").map(createDataFrame).flatten
如果您正在使用單個文件名,則可以執行以下操作:
createDataFrame("file1.csv") match {
case Some(df) => {
// proceed with your pipeline
val df2 = df.filter($"activityLabel" > 0).withColumn("binaryLabel", when($"activityLabel".between(1, 3), 0).otherwise(1))
}
case None => println("could not create dataframe")
}
在dataframe列上應用try和catch塊:
(try{$"credit.amount"} catch{case e:Exception=> lit(0)}).as("credit_amount")
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.