簡體   English   中英

Spark映射字符串到java.sql.Timestamp會產生不確定的異常

[英]Spark mapping string to java.sql.Timestamp yields nondeterministic exceptions

作為學習Spark的一部分,我試圖分析內部缺陷跟蹤系統中的問題。 以下是我正在處理的示例csv數據:

#;Projekt;Temat;Status;Typ zagadnienia;Miejsce wystąpienia;Obszar;Data utworzenia;Data zamknięcia;Przepracowany czas
10317;CENTRALA;some random topic;INBOX;Wsparcie;some place;some area;2016-02-22 13:33;;0,5
10315;CENTRALA;some random topic;Rozwiązany;Wsparcie;some place;some area;2016-02-22 13:28;2016-02-22 17:52;0,5
10313;CENTRALA;some random topic;Weryfikacja;Utrudnione działanie systemu;some place;some area;2016-02-22 12:39;;0,75
10311;CENTRALA;some random topic;Przypisany;Wsparcie;some place;some area;2016-02-22 11:57;;0
10309;CENTRALA;some random topic;INBOX;Wsparcie;some place;some area;2016-02-22 11:50;;0,83
10307;CENTRALA;some random topic;Rozwiązany;Wsparcie;some place;some area;2016-02-22 11:35;2016-02-22 13:18;0,42
10305;CENTRALA;some random topic;Przypisany;Usterka części systemu;some place;some area;2016-02-22 10:47;;0
10303;CENTRALA;some random topic;Nowy;Wsparcie;some place;some area;2016-02-22 10:39;;0
10301;CENTRALA;some random topic;Rozwiązany;Wsparcie;some place;some area;2016-02-20 11:30;2016-02-22 15:53;0,25
10297;CENTRALA;some random topic;INBOX;Utrudnione działanie systemu;some place;some area;2016-02-19 15:52;;0
10295;CENTRALA;some random topic;Przypisany;Utrudnione działanie systemu;some place;some area;2016-02-19 15:51;;0
10293;CENTRALA;some random topic;Rozwiązany;Wsparcie;some place;some area;2016-02-19 14:25;2016-02-19 14:25;0,25
10291;CENTRALA;some random topic;Przypisany;Wsparcie;some place;some area;2016-02-19 14:24;;0
10289;CENTRALA;some random topic;Rozwiązany;Wsparcie;some place;some area;2016-02-19 12:03;2016-02-19 15:00;1
10287;CENTRALA;some random topic;Weryfikacja;Wsparcie;some place;some area;2016-02-19 10:12;;0,33
10285;CENTRALA;some random topic;Dostępny na PRD;Usterka części systemu;some place;some area;2016-02-19 08:00;;1,5
10283;CENTRALA;some random topic;Nowy;Wsparcie;some place;some area;2016-02-18 18:56;;0
10281;CENTRALA;some random topic;Rozwiązany;Wsparcie;some place;some area;2016-02-18 16:59;2016-02-22 15:52;0,25
10279;CENTRALA;some random topic;Rozwiązany;Wsparcie;some place;some area;2016-02-18 16:33;2016-02-18 16:33;0,33
10277;CENTRALA;some random topic;Rozwiązany;Wsparcie;some place;some area;2016-02-18 16:04;2016-02-22 15:45;0,25

使用以下scala代碼:

  import scala.util.Try
  import java.text.SimpleDateFormat
  import java.sql.Timestamp
  import org.apache.spark.sql._
  import org.apache.spark.sql.functions._

  val issuesCSV = sc.textFile("""./issues_3.csv""")

  case class Issue(id: Int, 
                 project: String, 
                 topic: String, 
                 status: String,
                 issue_type: String, 
                 location: List[String],
                 area: List[String],
                 opened: Timestamp, 
                 closed: Option[Timestamp], 
                 spent_time: Float)

  val formatter = new SimpleDateFormat("yyyy-MM-dd hh:mm");

  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
  import sqlContext.implicits._

  val issues = issuesCSV
             .mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }
             .map(_.split(";"))
             .map(i => Issue(
           i(0).toInt, 
           i(1), 
           i(2), 
           i(3), 
           i(4), 
           i(5).split(',').toList, 
           i(6).split(',').toList, 
           new java.sql.Timestamp(formatter.parse(i(7)).getTime), 
           Try(new java.sql.Timestamp(formatter.parse(i(8)).getTime)).toOption, 
           i(9).replace(',','.').toFloat)).toDF()

  issues.printSchema()
  issues.count()
  issues.map(i => i(7)).collect()

現在,我的問題是最后一行的行為不確定:有時它給我預期的結果,更經常引發異常,如下所示:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 60.0 failed 1 times, most recent failure: Lost task 0.0 in stage 60.0 (TID 112, localhost): java.lang.NumberFormatException: multiple points
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1890)
    at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
    at java.lang.Double.parseDouble(Double.java:538)
    at java.text.DigitList.getDouble(DigitList.java:169)
    at java.text.DecimalFormat.parse(DecimalFormat.java:2056)
    at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:1869)
    at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1514)
    at java.text.DateFormat.parse(DateFormat.java:364)
    at $anonfun$2.apply(<console>:90)

錯誤始終與日期解析有關,盡管每個時間錯誤都是不同的,要么是“多個分隔符”對於輸入字符串:“。2216”對於輸入字符串:“” (文件中實際上都沒有發生)。 現在,我懷疑這與在多線程環境中在case類中創建對象有關,但是我真的不知道可能出了什么問題。 就環境而言,我正在使用本地Scala [2.11.7] Spark [1.6.0] Hadoop [2.7.1]開發Spark筆記本。

SimpleDateFormat不是線程安全的,這會導致異常間歇性地發生。 您需要使用線程安全類型(例如,Joda DateTimeFormat)或將用法限制為單個線程。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM