简体   繁体   English

如何处理多种日期格式? 火花 - Scala

[英]How to deal with multiple date format? Spark - Scala

I have data in Json format like this我有这样的Json格式的数据

....
{"Title":"51 Birch Street","US_Gross":84689,"Worldwide_Gross":84689,"US_DVD_Sales":null,"Production_Budget":350000,"Release_Date":"18-Oct-06","MPAA_Rating":"Not Rated","Running_Time_min":null,"Distributor":"Truly Indie","Source":null,"Major_Genre":null,"Creative_Type":null,"Director":null,"Rotten_Tomatoes_Rating":97,"IMDB_Rating":7.4,"IMDB_Votes":439}
{"Title":"55 Days at Peking","US_Gross":10000000,"Worldwide_Gross":10000000,"US_DVD_Sales":null,"Production_Budget":17000000,"Release_Date":"1963-01-01","MPAA_Rating":null,"Running_Time_min":null,"Distributor":null,"Source":"Original Screenplay","Major_Genre":"Drama","Creative_Type":null,"Director":null,"Rotten_Tomatoes_Rating":57,"IMDB_Rating":6.8,"IMDB_Votes":2104}
{"Title":"Nine 1/2 Weeks","US_Gross":6734844,"Worldwide_Gross":6734844,"US_DVD_Sales":null,"Production_Budget":18000000,"Release_Date":"21-Feb-86","MPAA_Rating":null,"Running_Time_min":null,"Distributor":"MGM","Source":"Based on Book/Short Story","Major_Genre":"Drama","Creative_Type":"Contemporary Fiction","Director":"Adrian Lyne","Rotten_Tomatoes_Rating":null,"IMDB_Rating":5.4,"IMDB_Votes":12731}
{"Title":"AstÈrix aux Jeux Olympiques","US_Gross":999811,"Worldwide_Gross":132999811,"US_DVD_Sales":null,"Production_Budget":113500000,"Release_Date":"4-Jul-08","MPAA_Rating":"Not Rated","Running_Time_min":null,"Distributor":"Alliance","Source":"Based on Comic/Graphic Novel","Major_Genre":"Adventure","Creative_Type":"Fantasy","Director":null,"Rotten_Tomatoes_Rating":null,"IMDB_Rating":4.9,"IMDB_Votes":5620}
{"Title":"The Abyss","US_Gross":54243125,"Worldwide_Gross":54243125,"US_DVD_Sales":null,"Production_Budget":70000000,"Release_Date":"9-Aug-89","MPAA_Rating":"PG-13","Running_Time_min":null,"Distributor":"20th Century Fox","Source":"Original Screenplay","Major_Genre":"Action","Creative_Type":"Science Fiction","Director":"James Cameron","Rotten_Tomatoes_Rating":88,"IMDB_Rating":7.6,"IMDB_Votes":51018}
{"Title":"Action Jackson","US_Gross":20257000,"Worldwide_Gross":20257000,"US_DVD_Sales":null,"Production_Budget":7000000,"Release_Date":"12-Feb-88","MPAA_Rating":null,"Running_Time_min":null,"Distributor":"Lorimar Motion Pictures","Source":"Original Screenplay","Major_Genre":"Action","Creative_Type":"Contemporary Fiction","Director":null,"Rotten_Tomatoes_Rating":10,"IMDB_Rating":4.6,"IMDB_Votes":3856}
{"Title":"Ace Ventura: Pet Detective","US_Gross":72217396,"Worldwide_Gross":107217396,"US_DVD_Sales":null,"Production_Budget":12000000,"Release_Date":"4-Feb-94","MPAA_Rating":"PG-13","Running_Time_min":null,"Distributor":"Warner Bros.","Source":"Original Screenplay","Major_Genre":"Comedy","Creative_Type":"Contemporary Fiction","Director":"Tom Shadyac","Rotten_Tomatoes_Rating":49,"IMDB_Rating":6.6,"IMDB_Votes":63543}
{"Title":"Ace Ventura: When Nature Calls","US_Gross":108360063,"Worldwide_Gross":212400000,"US_DVD_Sales":null,"Production_Budget":30000000,"Release_Date":"10-Nov-95","MPAA_Rating":"PG-13","Running_Time_min":null,"Distributor":"Warner Bros.","Source":"Original Screenplay","Major_Genre":"Comedy","Creative_Type":"Contemporary Fiction","Director":"Steve Oedekerk","Rotten_Tomatoes_Rating":null,"IMDB_Rating":5.6,"IMDB_Votes":51275}
{"Title":"April Fool's Day","US_Gross":12947763,"Worldwide_Gross":12947763,"US_DVD_Sales":null,"Production_Budget":5000000,"Release_Date":"27-Mar-86","MPAA_Rating":null,"Running_Time_min":null,"Distributor":"Paramount Pictures","Source":"Original Screenplay","Major_Genre":"Horror","Creative_Type":"Contemporary Fiction","Director":null,"Rotten_Tomatoes_Rating":31,"IMDB_Rating":null,"IMDB_Votes":null}
{"Title":"Among Giants","US_Gross":64359,"Worldwide_Gross":64359,"US_DVD_Sales":null,"Production_Budget":4000000,"Release_Date":"26-Mar-99","MPAA_Rating":"R","Running_Time_min":null,"Distributor":"Fox Searchlight","Source":"Original Screenplay","Major_Genre":"Romantic Comedy","Creative_Type":null,"Director":null,"Rotten_Tomatoes_Rating":null,"IMDB_Rating":5.7,"IMDB_Votes":546}
{"Title":"Annie Get Your Gun","US_Gross":8000000,"Worldwide_Gross":8000000,"US_DVD_Sales":null,"Production_Budget":3768785,"Release_Date":"17-May-50","MPAA_Rating":null,"Running_Time_min":null,"Distributor":"MGM","Source":"Based on Book/Short Story","Major_Genre":"Musical","Creative_Type":null,"Director":null,"Rotten_Tomatoes_Rating":100,"IMDB_Rating":7.1,"IMDB_Votes":1326}
{"Title":"Alice in Wonderland","US_Gross":0,"Worldwide_Gross":0,"US_DVD_Sales":null,"Production_Budget":3000000,"Release_Date":"28-Jul-51","MPAA_Rating":null,"Running_Time_min":null,"Distributor":"RKO Radio Pictures","Source":"Based on Book/Short Story","Major_Genre":"Musical","Creative_Type":null,"Director":null,"Rotten_Tomatoes_Rating":20,"IMDB_Rating":6.7,"IMDB_Votes":63458}
{"Title":"The Princess and the Cobbler","US_Gross":669276,"Worldwide_Gross":669276,"US_DVD_Sales":null,"Production_Budget":24000000,"Release_Date":"25-Aug-95","MPAA_Rating":"G","Running_Time_min":null,"Distributor":"Miramax","Source":"Original Screenplay","Major_Genre":"Adventure","Creative_Type":"Fantasy","Director":null,"Rotten_Tomatoes_Rating":null,"IMDB_Rating":7.3,"IMDB_Votes":893}
....

where I have several date formats in the field "Release_Date" like 26-Mar-99 or 1963-01-01 or 4-Jul-08我在"Release_Date"字段中有几种日期格式,例如26-Mar-991963-01-014-Jul-08

I have some code working我有一些代码工作

      val moviesDF = spark.read
        .option("inferSchema", "true")
        .json(s"${path}/movies.json")

       moviesDF.show(truncate = false)

      val moviesWithReleaseDates = moviesDF
        .select(col("Title"), to_date(col("Release_Date"), "dd-MMM-yy").as("Actual_Release")) // conversion
      moviesWithReleaseDates.show(truncate = false)

but the output但是 output

|Four Rooms                                |1995-12-25    |
|The Four Seasons                          |1981-05-22    |
|Four Weddings and a Funeral               |1994-03-09    |
|51 Birch Street                           |2006-10-18    |
|55 Days at Peking                         |null          |
|Nine 1/2 Weeks                            |1986-02-21    |
|AstÈrix aux Jeux Olympiques               |2008-07-04    |
|The Abyss                                 |1989-08-09    |
|Action Jackson                            |1988-02-12    |
|Ace Ventura: Pet Detective                |1994-02-04    |

when the date format is like "18-Oct-06" it is working fine, but when the date format is different it is showing nulls.当日期格式类似于"18-Oct-06"时,它工作正常,但当日期格式不同时,它显示空值。

To show all the dates without nulls, how could I do this in a simple and elegant way?要显示没有空值的所有日期,我怎么能以一种简单而优雅的方式做到这一点?

Thanks in advance.提前致谢。

It's because to_date(col("Release_Date"), "dd-MMM-yy") .这是因为to_date(col("Release_Date"), "dd-MMM-yy") Here you are providing input date format & it reads correctly if json date format matches this.在这里,您提供输入日期格式,如果 json 日期格式与此匹配,则它会正确读取。 If not, it would null如果没有,它会 null

Now you have to read date text from json with all possible date formats.现在您必须从 json 读取所有可能的日期格式的日期文本。

Write a udf.写一个udf。 Pass date text as input to it.将日期文本作为输入传递给它。 In udf, check for possible date format & if matches, return proper date object.在 udf 中,检查可能的日期格式,如果匹配,则返回正确的日期 object。 UDF is certainly helpful here UDF 在这里肯定很有帮助

In any case you need to have the finite list of date format you have in the file for Release_Date or you want to support in the processing.在任何情况下,您都需要在文件中为Release_Date提供有限的日期格式列表,或者您希望在处理中提供支持。

you can write udf to parse the date string using below method -您可以使用以下方法编写udf来解析date string -

val formatStrings = Seq("dd-MMM-yy", "yyyy-MM-dd","other-formats")
    import java.text.SimpleDateFormat
    def tryParse(dateString: String): java.util.Date = {
      val parser: String => java.util.Date = dateStr => new SimpleDateFormat(dateStr).parse(dateString)
      formatStrings.map(parser).filter(_ != null).head
    }

or use coalesce或使用coalesce

coalesce(
to_date(col("Release_Date"), "dd-MMM-yy"),
to_date(col("Release_Date"), "yyyy-MM-dd"),
to_date(col("Release_Date"), "other-date-format")
).as("Actual_Release")

or或者

val dt_formats= Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")

val newDF =  df.withColumn("Actual_Release", coalesce(dt_formats.map(fmt => to_date($"Release_Date", fmt)):_*))

You could try something like this, I don't know if it is elegant but simple it is:你可以尝试这样的事情,我不知道它是否优雅但很简单:

val mWRD = moviesDF.selectExpr("""Title""",
"""IF(LENGTH(Release_Date) <= 9,to_date(Release_Date,'dd-MMM-yy'),
to_date(Release_Date,'yyyy-MM-dd')) AS Actual_Release""")
mWRD.show(truncate = false)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM