簡體   English   中英

如何處理多種日期格式? 火花 - Scala

[英]How to deal with multiple date format? Spark - Scala

我有這樣的Json格式的數據

....
{"Title":"51 Birch Street","US_Gross":84689,"Worldwide_Gross":84689,"US_DVD_Sales":null,"Production_Budget":350000,"Release_Date":"18-Oct-06","MPAA_Rating":"Not Rated","Running_Time_min":null,"Distributor":"Truly Indie","Source":null,"Major_Genre":null,"Creative_Type":null,"Director":null,"Rotten_Tomatoes_Rating":97,"IMDB_Rating":7.4,"IMDB_Votes":439}
{"Title":"55 Days at Peking","US_Gross":10000000,"Worldwide_Gross":10000000,"US_DVD_Sales":null,"Production_Budget":17000000,"Release_Date":"1963-01-01","MPAA_Rating":null,"Running_Time_min":null,"Distributor":null,"Source":"Original Screenplay","Major_Genre":"Drama","Creative_Type":null,"Director":null,"Rotten_Tomatoes_Rating":57,"IMDB_Rating":6.8,"IMDB_Votes":2104}
{"Title":"Nine 1/2 Weeks","US_Gross":6734844,"Worldwide_Gross":6734844,"US_DVD_Sales":null,"Production_Budget":18000000,"Release_Date":"21-Feb-86","MPAA_Rating":null,"Running_Time_min":null,"Distributor":"MGM","Source":"Based on Book/Short Story","Major_Genre":"Drama","Creative_Type":"Contemporary Fiction","Director":"Adrian Lyne","Rotten_Tomatoes_Rating":null,"IMDB_Rating":5.4,"IMDB_Votes":12731}
{"Title":"AstÈrix aux Jeux Olympiques","US_Gross":999811,"Worldwide_Gross":132999811,"US_DVD_Sales":null,"Production_Budget":113500000,"Release_Date":"4-Jul-08","MPAA_Rating":"Not Rated","Running_Time_min":null,"Distributor":"Alliance","Source":"Based on Comic/Graphic Novel","Major_Genre":"Adventure","Creative_Type":"Fantasy","Director":null,"Rotten_Tomatoes_Rating":null,"IMDB_Rating":4.9,"IMDB_Votes":5620}
{"Title":"The Abyss","US_Gross":54243125,"Worldwide_Gross":54243125,"US_DVD_Sales":null,"Production_Budget":70000000,"Release_Date":"9-Aug-89","MPAA_Rating":"PG-13","Running_Time_min":null,"Distributor":"20th Century Fox","Source":"Original Screenplay","Major_Genre":"Action","Creative_Type":"Science Fiction","Director":"James Cameron","Rotten_Tomatoes_Rating":88,"IMDB_Rating":7.6,"IMDB_Votes":51018}
{"Title":"Action Jackson","US_Gross":20257000,"Worldwide_Gross":20257000,"US_DVD_Sales":null,"Production_Budget":7000000,"Release_Date":"12-Feb-88","MPAA_Rating":null,"Running_Time_min":null,"Distributor":"Lorimar Motion Pictures","Source":"Original Screenplay","Major_Genre":"Action","Creative_Type":"Contemporary Fiction","Director":null,"Rotten_Tomatoes_Rating":10,"IMDB_Rating":4.6,"IMDB_Votes":3856}
{"Title":"Ace Ventura: Pet Detective","US_Gross":72217396,"Worldwide_Gross":107217396,"US_DVD_Sales":null,"Production_Budget":12000000,"Release_Date":"4-Feb-94","MPAA_Rating":"PG-13","Running_Time_min":null,"Distributor":"Warner Bros.","Source":"Original Screenplay","Major_Genre":"Comedy","Creative_Type":"Contemporary Fiction","Director":"Tom Shadyac","Rotten_Tomatoes_Rating":49,"IMDB_Rating":6.6,"IMDB_Votes":63543}
{"Title":"Ace Ventura: When Nature Calls","US_Gross":108360063,"Worldwide_Gross":212400000,"US_DVD_Sales":null,"Production_Budget":30000000,"Release_Date":"10-Nov-95","MPAA_Rating":"PG-13","Running_Time_min":null,"Distributor":"Warner Bros.","Source":"Original Screenplay","Major_Genre":"Comedy","Creative_Type":"Contemporary Fiction","Director":"Steve Oedekerk","Rotten_Tomatoes_Rating":null,"IMDB_Rating":5.6,"IMDB_Votes":51275}
{"Title":"April Fool's Day","US_Gross":12947763,"Worldwide_Gross":12947763,"US_DVD_Sales":null,"Production_Budget":5000000,"Release_Date":"27-Mar-86","MPAA_Rating":null,"Running_Time_min":null,"Distributor":"Paramount Pictures","Source":"Original Screenplay","Major_Genre":"Horror","Creative_Type":"Contemporary Fiction","Director":null,"Rotten_Tomatoes_Rating":31,"IMDB_Rating":null,"IMDB_Votes":null}
{"Title":"Among Giants","US_Gross":64359,"Worldwide_Gross":64359,"US_DVD_Sales":null,"Production_Budget":4000000,"Release_Date":"26-Mar-99","MPAA_Rating":"R","Running_Time_min":null,"Distributor":"Fox Searchlight","Source":"Original Screenplay","Major_Genre":"Romantic Comedy","Creative_Type":null,"Director":null,"Rotten_Tomatoes_Rating":null,"IMDB_Rating":5.7,"IMDB_Votes":546}
{"Title":"Annie Get Your Gun","US_Gross":8000000,"Worldwide_Gross":8000000,"US_DVD_Sales":null,"Production_Budget":3768785,"Release_Date":"17-May-50","MPAA_Rating":null,"Running_Time_min":null,"Distributor":"MGM","Source":"Based on Book/Short Story","Major_Genre":"Musical","Creative_Type":null,"Director":null,"Rotten_Tomatoes_Rating":100,"IMDB_Rating":7.1,"IMDB_Votes":1326}
{"Title":"Alice in Wonderland","US_Gross":0,"Worldwide_Gross":0,"US_DVD_Sales":null,"Production_Budget":3000000,"Release_Date":"28-Jul-51","MPAA_Rating":null,"Running_Time_min":null,"Distributor":"RKO Radio Pictures","Source":"Based on Book/Short Story","Major_Genre":"Musical","Creative_Type":null,"Director":null,"Rotten_Tomatoes_Rating":20,"IMDB_Rating":6.7,"IMDB_Votes":63458}
{"Title":"The Princess and the Cobbler","US_Gross":669276,"Worldwide_Gross":669276,"US_DVD_Sales":null,"Production_Budget":24000000,"Release_Date":"25-Aug-95","MPAA_Rating":"G","Running_Time_min":null,"Distributor":"Miramax","Source":"Original Screenplay","Major_Genre":"Adventure","Creative_Type":"Fantasy","Director":null,"Rotten_Tomatoes_Rating":null,"IMDB_Rating":7.3,"IMDB_Votes":893}
....

我在"Release_Date"字段中有幾種日期格式,例如26-Mar-991963-01-014-Jul-08

我有一些代碼工作

      val moviesDF = spark.read
        .option("inferSchema", "true")
        .json(s"${path}/movies.json")

       moviesDF.show(truncate = false)

      val moviesWithReleaseDates = moviesDF
        .select(col("Title"), to_date(col("Release_Date"), "dd-MMM-yy").as("Actual_Release")) // conversion
      moviesWithReleaseDates.show(truncate = false)

但是 output

|Four Rooms                                |1995-12-25    |
|The Four Seasons                          |1981-05-22    |
|Four Weddings and a Funeral               |1994-03-09    |
|51 Birch Street                           |2006-10-18    |
|55 Days at Peking                         |null          |
|Nine 1/2 Weeks                            |1986-02-21    |
|AstÈrix aux Jeux Olympiques               |2008-07-04    |
|The Abyss                                 |1989-08-09    |
|Action Jackson                            |1988-02-12    |
|Ace Ventura: Pet Detective                |1994-02-04    |

當日期格式類似於"18-Oct-06"時,它工作正常,但當日期格式不同時,它顯示空值。

要顯示沒有空值的所有日期,我怎么能以一種簡單而優雅的方式做到這一點?

提前致謝。

這是因為to_date(col("Release_Date"), "dd-MMM-yy") 在這里,您提供輸入日期格式,如果 json 日期格式與此匹配,則它會正確讀取。 如果沒有,它會 null

現在您必須從 json 讀取所有可能的日期格式的日期文本。

寫一個udf。 將日期文本作為輸入傳遞給它。 在 udf 中,檢查可能的日期格式,如果匹配,則返回正確的日期 object。 UDF 在這里肯定很有幫助

在任何情況下,您都需要在文件中為Release_Date提供有限的日期格式列表,或者您希望在處理中提供支持。

您可以使用以下方法編寫udf來解析date string -

val formatStrings = Seq("dd-MMM-yy", "yyyy-MM-dd","other-formats")
    import java.text.SimpleDateFormat
    def tryParse(dateString: String): java.util.Date = {
      val parser: String => java.util.Date = dateStr => new SimpleDateFormat(dateStr).parse(dateString)
      formatStrings.map(parser).filter(_ != null).head
    }

或使用coalesce

coalesce(
to_date(col("Release_Date"), "dd-MMM-yy"),
to_date(col("Release_Date"), "yyyy-MM-dd"),
to_date(col("Release_Date"), "other-date-format")
).as("Actual_Release")

或者

val dt_formats= Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")

val newDF =  df.withColumn("Actual_Release", coalesce(dt_formats.map(fmt => to_date($"Release_Date", fmt)):_*))

你可以嘗試這樣的事情,我不知道它是否優雅但很簡單:

val mWRD = moviesDF.selectExpr("""Title""",
"""IF(LENGTH(Release_Date) <= 9,to_date(Release_Date,'dd-MMM-yy'),
to_date(Release_Date,'yyyy-MM-dd')) AS Actual_Release""")
mWRD.show(truncate = false)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM