[英]How to deal with multiple date format? Spark - Scala
我有這樣的Json
格式的數據
....
{"Title":"51 Birch Street","US_Gross":84689,"Worldwide_Gross":84689,"US_DVD_Sales":null,"Production_Budget":350000,"Release_Date":"18-Oct-06","MPAA_Rating":"Not Rated","Running_Time_min":null,"Distributor":"Truly Indie","Source":null,"Major_Genre":null,"Creative_Type":null,"Director":null,"Rotten_Tomatoes_Rating":97,"IMDB_Rating":7.4,"IMDB_Votes":439}
{"Title":"55 Days at Peking","US_Gross":10000000,"Worldwide_Gross":10000000,"US_DVD_Sales":null,"Production_Budget":17000000,"Release_Date":"1963-01-01","MPAA_Rating":null,"Running_Time_min":null,"Distributor":null,"Source":"Original Screenplay","Major_Genre":"Drama","Creative_Type":null,"Director":null,"Rotten_Tomatoes_Rating":57,"IMDB_Rating":6.8,"IMDB_Votes":2104}
{"Title":"Nine 1/2 Weeks","US_Gross":6734844,"Worldwide_Gross":6734844,"US_DVD_Sales":null,"Production_Budget":18000000,"Release_Date":"21-Feb-86","MPAA_Rating":null,"Running_Time_min":null,"Distributor":"MGM","Source":"Based on Book/Short Story","Major_Genre":"Drama","Creative_Type":"Contemporary Fiction","Director":"Adrian Lyne","Rotten_Tomatoes_Rating":null,"IMDB_Rating":5.4,"IMDB_Votes":12731}
{"Title":"AstÈrix aux Jeux Olympiques","US_Gross":999811,"Worldwide_Gross":132999811,"US_DVD_Sales":null,"Production_Budget":113500000,"Release_Date":"4-Jul-08","MPAA_Rating":"Not Rated","Running_Time_min":null,"Distributor":"Alliance","Source":"Based on Comic/Graphic Novel","Major_Genre":"Adventure","Creative_Type":"Fantasy","Director":null,"Rotten_Tomatoes_Rating":null,"IMDB_Rating":4.9,"IMDB_Votes":5620}
{"Title":"The Abyss","US_Gross":54243125,"Worldwide_Gross":54243125,"US_DVD_Sales":null,"Production_Budget":70000000,"Release_Date":"9-Aug-89","MPAA_Rating":"PG-13","Running_Time_min":null,"Distributor":"20th Century Fox","Source":"Original Screenplay","Major_Genre":"Action","Creative_Type":"Science Fiction","Director":"James Cameron","Rotten_Tomatoes_Rating":88,"IMDB_Rating":7.6,"IMDB_Votes":51018}
{"Title":"Action Jackson","US_Gross":20257000,"Worldwide_Gross":20257000,"US_DVD_Sales":null,"Production_Budget":7000000,"Release_Date":"12-Feb-88","MPAA_Rating":null,"Running_Time_min":null,"Distributor":"Lorimar Motion Pictures","Source":"Original Screenplay","Major_Genre":"Action","Creative_Type":"Contemporary Fiction","Director":null,"Rotten_Tomatoes_Rating":10,"IMDB_Rating":4.6,"IMDB_Votes":3856}
{"Title":"Ace Ventura: Pet Detective","US_Gross":72217396,"Worldwide_Gross":107217396,"US_DVD_Sales":null,"Production_Budget":12000000,"Release_Date":"4-Feb-94","MPAA_Rating":"PG-13","Running_Time_min":null,"Distributor":"Warner Bros.","Source":"Original Screenplay","Major_Genre":"Comedy","Creative_Type":"Contemporary Fiction","Director":"Tom Shadyac","Rotten_Tomatoes_Rating":49,"IMDB_Rating":6.6,"IMDB_Votes":63543}
{"Title":"Ace Ventura: When Nature Calls","US_Gross":108360063,"Worldwide_Gross":212400000,"US_DVD_Sales":null,"Production_Budget":30000000,"Release_Date":"10-Nov-95","MPAA_Rating":"PG-13","Running_Time_min":null,"Distributor":"Warner Bros.","Source":"Original Screenplay","Major_Genre":"Comedy","Creative_Type":"Contemporary Fiction","Director":"Steve Oedekerk","Rotten_Tomatoes_Rating":null,"IMDB_Rating":5.6,"IMDB_Votes":51275}
{"Title":"April Fool's Day","US_Gross":12947763,"Worldwide_Gross":12947763,"US_DVD_Sales":null,"Production_Budget":5000000,"Release_Date":"27-Mar-86","MPAA_Rating":null,"Running_Time_min":null,"Distributor":"Paramount Pictures","Source":"Original Screenplay","Major_Genre":"Horror","Creative_Type":"Contemporary Fiction","Director":null,"Rotten_Tomatoes_Rating":31,"IMDB_Rating":null,"IMDB_Votes":null}
{"Title":"Among Giants","US_Gross":64359,"Worldwide_Gross":64359,"US_DVD_Sales":null,"Production_Budget":4000000,"Release_Date":"26-Mar-99","MPAA_Rating":"R","Running_Time_min":null,"Distributor":"Fox Searchlight","Source":"Original Screenplay","Major_Genre":"Romantic Comedy","Creative_Type":null,"Director":null,"Rotten_Tomatoes_Rating":null,"IMDB_Rating":5.7,"IMDB_Votes":546}
{"Title":"Annie Get Your Gun","US_Gross":8000000,"Worldwide_Gross":8000000,"US_DVD_Sales":null,"Production_Budget":3768785,"Release_Date":"17-May-50","MPAA_Rating":null,"Running_Time_min":null,"Distributor":"MGM","Source":"Based on Book/Short Story","Major_Genre":"Musical","Creative_Type":null,"Director":null,"Rotten_Tomatoes_Rating":100,"IMDB_Rating":7.1,"IMDB_Votes":1326}
{"Title":"Alice in Wonderland","US_Gross":0,"Worldwide_Gross":0,"US_DVD_Sales":null,"Production_Budget":3000000,"Release_Date":"28-Jul-51","MPAA_Rating":null,"Running_Time_min":null,"Distributor":"RKO Radio Pictures","Source":"Based on Book/Short Story","Major_Genre":"Musical","Creative_Type":null,"Director":null,"Rotten_Tomatoes_Rating":20,"IMDB_Rating":6.7,"IMDB_Votes":63458}
{"Title":"The Princess and the Cobbler","US_Gross":669276,"Worldwide_Gross":669276,"US_DVD_Sales":null,"Production_Budget":24000000,"Release_Date":"25-Aug-95","MPAA_Rating":"G","Running_Time_min":null,"Distributor":"Miramax","Source":"Original Screenplay","Major_Genre":"Adventure","Creative_Type":"Fantasy","Director":null,"Rotten_Tomatoes_Rating":null,"IMDB_Rating":7.3,"IMDB_Votes":893}
....
我在"Release_Date"
字段中有幾種日期格式,例如26-Mar-99
或1963-01-01
或4-Jul-08
我有一些代碼工作
val moviesDF = spark.read
.option("inferSchema", "true")
.json(s"${path}/movies.json")
moviesDF.show(truncate = false)
val moviesWithReleaseDates = moviesDF
.select(col("Title"), to_date(col("Release_Date"), "dd-MMM-yy").as("Actual_Release")) // conversion
moviesWithReleaseDates.show(truncate = false)
但是 output
|Four Rooms |1995-12-25 |
|The Four Seasons |1981-05-22 |
|Four Weddings and a Funeral |1994-03-09 |
|51 Birch Street |2006-10-18 |
|55 Days at Peking |null |
|Nine 1/2 Weeks |1986-02-21 |
|AstÈrix aux Jeux Olympiques |2008-07-04 |
|The Abyss |1989-08-09 |
|Action Jackson |1988-02-12 |
|Ace Ventura: Pet Detective |1994-02-04 |
當日期格式類似於"18-Oct-06"
時,它工作正常,但當日期格式不同時,它顯示空值。
要顯示沒有空值的所有日期,我怎么能以一種簡單而優雅的方式做到這一點?
提前致謝。
這是因為to_date(col("Release_Date"), "dd-MMM-yy")
。 在這里,您提供輸入日期格式,如果 json 日期格式與此匹配,則它會正確讀取。 如果沒有,它會 null
現在您必須從 json 讀取所有可能的日期格式的日期文本。
寫一個udf。 將日期文本作為輸入傳遞給它。 在 udf 中,檢查可能的日期格式,如果匹配,則返回正確的日期 object。 UDF 在這里肯定很有幫助
在任何情況下,您都需要在文件中為Release_Date
提供有限的日期格式列表,或者您希望在處理中提供支持。
您可以使用以下方法編寫udf
來解析date string
-
val formatStrings = Seq("dd-MMM-yy", "yyyy-MM-dd","other-formats")
import java.text.SimpleDateFormat
def tryParse(dateString: String): java.util.Date = {
val parser: String => java.util.Date = dateStr => new SimpleDateFormat(dateStr).parse(dateString)
formatStrings.map(parser).filter(_ != null).head
}
或使用coalesce
coalesce(
to_date(col("Release_Date"), "dd-MMM-yy"),
to_date(col("Release_Date"), "yyyy-MM-dd"),
to_date(col("Release_Date"), "other-date-format")
).as("Actual_Release")
或者
val dt_formats= Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")
val newDF = df.withColumn("Actual_Release", coalesce(dt_formats.map(fmt => to_date($"Release_Date", fmt)):_*))
你可以嘗試這樣的事情,我不知道它是否優雅但很簡單:
val mWRD = moviesDF.selectExpr("""Title""",
"""IF(LENGTH(Release_Date) <= 9,to_date(Release_Date,'dd-MMM-yy'),
to_date(Release_Date,'yyyy-MM-dd')) AS Actual_Release""")
mWRD.show(truncate = false)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.