[英]Convert date from String to Date format in Dataframes

I am trying to convert a column which is in String format to Date format using the to_date function but its returning Null values.我正在尝试使用to_date函数将字符串格式的列转换为日期格式,但它返回 Null 值。

spark.sql("select Date from incidents").show()

|      Date|

spark.sql("select to_date(Date) from incidents").show()

|to_date(CAST(Date AS DATE))|
|                       null|
|                       null|
|                       null|
|                       null|

The Date column is in String format:日期列为字符串格式:

 |-- Date: string (nullable = true)

Use to_date with Java SimpleDateFormat . 使用to_date和Java SimpleDateFormat


Example: 例:

  SELECT TO_DATE(CAST(UNIX_TIMESTAMP('08/26/2016', 'MM/dd/yyyy') AS TIMESTAMP)) AS newdate"""

|        dt|

I solved the same problem without the temp table/view and with dataframe functions. 我在没有临时表/视图和数据帧功能的情况下解决了同样的问题。

Of course I found that only one format works with this solution and that's yyyy-MM-DD . 当然我发现只有一种格式适用于这种解决方案,那就是yyyy-MM-DD

For example: 例如:

val df = sc.parallelize(Seq("2016-08-26")).toDF("Id")
val df2 = df.withColumn("Timestamp", (col("Id").cast("timestamp")))
val df3 = df2.withColumn("Date", (col("Id").cast("date")))


 |-- Id: string (nullable = true)
 |-- Timestamp: timestamp (nullable = true)
 |-- Date: date (nullable = true)


|        Id|           Timestamp|      Date|
|2016-08-26|2016-08-26 00:00:...|2016-08-26|

The timestamp of course has 00:00:00.0 as a time value. 时间戳当然是00:00:00.0作为时间值。

Since your main aim was to convert the type of a column in a DataFrame from String to Timestamp, I think this approach would be better. 由于您的主要目标是将DataFrame中的列类型从String转换为Timestamp,我认为这种方法会更好。

import org.apache.spark.sql.functions.{to_date, to_timestamp}
val modifiedDF = DF.withColumn("Date", to_date($"Date", "MM/dd/yyyy"))

You could also use to_timestamp (I think this is available from Spark 2.x) if you require fine grained timestamp. 如果你需要细粒度的时间戳,你也可以使用to_timestamp (我认为这可以从Spark 2.x获得)。

you can also do this query...! 你也可以这样查询...!

select from_unixtime(unix_timestamp('08/26/2016', 'MM/dd/yyyy'), 'yyyy:MM:dd') as new_format


You can also pass date format 您还可以传递日期格式

df.withColumn("Date",to_date(unix_timestamp(df.col("your_date_column"), "your_date_format").cast("timestamp")))

For Example 例如

import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq("06 Jul 2018")).toDF("dateCol")
df.withColumn("Date",to_date(unix_timestamp(df.col("dateCol"), "dd MMM yyyy").cast("timestamp")))


spark.sql("SELECT from_unixtime(unix_timestamp(cast(dateid as varchar(10)), 'yyyymmdd'), 'yyyy-mm-dd') from XYZ").show(50, false)

Find the below-mentioned code, it might be helpful for you.找到下面提到的代码,它可能对你有帮助。

   val stringDate = spark.sparkContext.parallelize(Seq("12/16/2019")).toDF("StringDate")
                    val dateCoversion = stringDate.withColumn("dateColumn", to_date(unix_timestamp($"StringDate", "dd/mm/yyyy").cast("Timestamp")))

The solution proposed above by Sai Kiriti Badam worked for me. Sai Kiriti Badam上面提出的解决方案为我工作。

I'm using Azure Databricks to read data captured from an EventHub. 我正在使用Azure Databricks来读取从EventHub捕获的数据。 This contains a string column named EnqueuedTimeUtc with the following format... 它包含一个名为EnqueuedTimeUtc的字符串列,格式如下......

12/7/2018 12:54:13 PM 12/7/2018 12:54:13 PM

I'm using a Python notebook and used the following... 我正在使用Python笔记本并使用以下内容...

import pyspark.sql.functions as func

sports_messages = sports_df.withColumn("EnqueuedTimestamp", func.to_timestamp("EnqueuedTimeUtc", "MM/dd/yyyy hh:mm:ss aaa"))

... to create a new column EnqueuedTimestamp of type "timestamp" with data in the following format... ...使用以下格式的数据创建一个类型为“timestamp”的新列EnqueuedTimestamp ...

2018-12-07 12:54:13 2018-12-07 12:54:13

I have personally found some errors in when using unix_timestamp based date converstions from dd-MMM-yyyy format to yyyy-mm-dd, using spark 1.6, but this may extend into recent versions. 我个人发现在使用基于unix_timestamp的日期转换从dd-MMM-yyyy格式到yyyy-mm-dd时使用spark 1.6时会出现一些错误,但这可能会扩展到最新版本。 Below I explain a way to solve the problem using java.time that should work in all versions of spark: 下面我解释一种使用java.time解决问题的方法,该方法应该适用于所有版本的spark:

I've seen errors when doing: 在做的时候我看到了错误:

from_unixtime(unix_timestamp(StockMarketClosingDate, 'dd-MMM-yyyy'), 'yyyy-MM-dd') as FormattedDate

Below is code to illustrate the error, and my solution to fix it. 下面是用于说明错误的代码,以及我解决它的解决方案。 First I read in stock market data, in a common standard file format: 首先,我以通用的标准文件格式阅读股票市场数据:

    import sys.process._
    import org.apache.spark.sql.SQLContext
    import org.apache.spark.sql.functions.udf
    import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DateType}
    import sqlContext.implicits._

    val EODSchema = StructType(Array(
        StructField("Symbol"                , StringType, true),     //$1       
        StructField("Date"                  , StringType, true),     //$2       
        StructField("Open"                  , StringType, true),     //$3       
        StructField("High"                  , StringType, true),     //$4
        StructField("Low"                   , StringType, true),     //$5
        StructField("Close"                 , StringType, true),     //$6
        StructField("Volume"                , StringType, true)      //$7

    val textFileName = "/user/feeds/eoddata/INDEX/INDEX_19*.csv"

    // below is code to read using later versions of spark
    //val eoddata = spark.read.format("csv").option("sep", ",").schema(EODSchema).option("header", "true").load(textFileName)

    // here is code to read using 1.6, via, "com.databricks:spark-csv_2.10:1.2.0"

    val eoddata = sqlContext.read
                               .option("header", "true")                               // Use first line of all files as header
                               .option("delimiter", ",")                               //.option("dateFormat", "dd-MMM-yyyy") failed to work


And here is the date conversions having issues: 以下是有问题的日期转换:

-- notice there are errors around the turn of the year
    e.Date as StringDate
,   cast(from_unixtime(unix_timestamp(e.Date, "dd-MMM-yyyy"), 'YYYY-MM-dd') as Date)  as ProperDate
,   e.Close
from eoddata e
where e.Symbol = 'SPX.IDX'
order by cast(from_unixtime(unix_timestamp(e.Date, "dd-MMM-yyyy"), 'YYYY-MM-dd') as Date)
limit 1000

A chart made in zeppelin shows spikes, which are errors. 在zeppelin中制作的图表显示了峰值,这是错误。


and here is the check that shows the date conversion errors: 这是显示日期转换错误的检查:

// shows the unix_timestamp conversion approach can create errors
val result =  sqlContext.sql("""
Select errors.* from
    , substring(t.OriginalStringDate, 8, 11) as String_Year_yyyy 
    , substring(t.ConvertedCloseDate, 0, 4)  as Converted_Date_Year_yyyy
    (        Select
            ,   cast(from_unixtime(unix_timestamp(e.Date, "dd-MMM-yyyy"), 'YYYY-MM-dd') as Date)  as ConvertedCloseDate
            ,   e.Date as OriginalStringDate
            ,   Close
            from eoddata e
            where e.Symbol = 'SPX.IDX'
    ) t 
) errors
where String_Year_yyyy <> Converted_Date_Year_yyyy

//df.withColumn("tx_date", to_date(unix_timestamp($"date", "M/dd/yyyy").cast("timestamp")))

result: org.apache.spark.sql.DataFrame = [Symbol: string, ConvertedCloseDate: date, OriginalStringDate: string, Close: string, String_Year_yyyy: string, Converted_Date_Year_yyyy: string]
res53: result.type = [Symbol: string, ConvertedCloseDate: date, OriginalStringDate: string, Close: string, String_Year_yyyy: string, Converted_Date_Year_yyyy: string]
| Symbol|ConvertedCloseDate|OriginalStringDate|  Close|String_Year_yyyy|Converted_Date_Year_yyyy|
|SPX.IDX|        1997-12-30|       30-Dec-1996| 753.85|            1996|                    1997|
|SPX.IDX|        1997-12-31|       31-Dec-1996| 740.74|            1996|                    1997|
|SPX.IDX|        1998-12-29|       29-Dec-1997| 953.36|            1997|                    1998|
|SPX.IDX|        1998-12-30|       30-Dec-1997| 970.84|            1997|                    1998|
|SPX.IDX|        1998-12-31|       31-Dec-1997| 970.43|            1997|                    1998|
|SPX.IDX|        1998-01-01|       01-Jan-1999|1229.23|            1999|                    1998|

After this result, I switched to java.time conversions with a UDF like this, which worked for me: 在这个结果之后,我用这样的UDF切换到java.time转换,这对我有用:

// now we will create a UDF that uses the very nice java.time library to properly convert the silly stockmarket dates
// start by importing the specific java.time libraries that superceded the joda.time ones
import java.time.LocalDate
import java.time.format.DateTimeFormatter

// now define a specific data conversion function we want

def fromEODDate (YourStringDate: String): String = {

    val formatter = DateTimeFormatter.ofPattern("dd-MMM-yyyy")
    var   retDate = LocalDate.parse(YourStringDate, formatter)

    // this should return a proper yyyy-MM-dd date from the silly dd-MMM-yyyy formats
    // now we format this true local date with a formatter to the desired yyyy-MM-dd format

    val retStringDate = retDate.format(DateTimeFormatter.ISO_LOCAL_DATE)

Now I register it as a function for use in sql: 现在我将其注册为在sql中使用的函数:

sqlContext.udf.register("fromEODDate", fromEODDate(_:String))

and check the results, and rerun test: 并检查结果,并重新运行测试:

val results = sqlContext.sql("""
        e.Symbol    as Symbol
    ,   e.Date      as OrigStringDate
    ,   Cast(fromEODDate(e.Date) as Date) as ConvertedDate
    ,   e.Open
    ,   e.High
    ,   e.Low
    ,   e.Close
    from eoddata e
    order by Cast(fromEODDate(e.Date) as Date)

results: org.apache.spark.sql.DataFrame = [Symbol: string, OrigStringDate: string, ConvertedDate: date, Open: string, High: string, Low: string, Close: string]
 |-- Symbol: string (nullable = true)
 |-- OrigStringDate: string (nullable = true)
 |-- ConvertedDate: date (nullable = true)
 |-- Open: string (nullable = true)
 |-- High: string (nullable = true)
 |-- Low: string (nullable = true)
 |-- Close: string (nullable = true)
res79: results.type = [Symbol: string, OrigStringDate: string, ConvertedDate: date, Open: string, High: string, Low: string, Close: string]
|  Symbol|OrigStringDate|ConvertedDate|   Open|   High|    Low|  Close|
|ADVA.IDX|   01-Jan-1996|   1996-01-01|    364|    364|    364|    364|
|ADVN.IDX|   01-Jan-1996|   1996-01-01|   1527|   1527|   1527|   1527|
|ADVQ.IDX|   01-Jan-1996|   1996-01-01|   1283|   1283|   1283|   1283|
|BANK.IDX|   01-Jan-1996|   1996-01-01|1009.41|1009.41|1009.41|1009.41|
| BKX.IDX|   01-Jan-1996|   1996-01-01|  39.39|  39.39|  39.39|  39.39|
|COMP.IDX|   01-Jan-1996|   1996-01-01|1052.13|1052.13|1052.13|1052.13|
| CPR.IDX|   01-Jan-1996|   1996-01-01|  1.261|  1.261|  1.261|  1.261|
|DECA.IDX|   01-Jan-1996|   1996-01-01|    205|    205|    205|    205|
|DECN.IDX|   01-Jan-1996|   1996-01-01|    825|    825|    825|    825|
|DECQ.IDX|   01-Jan-1996|   1996-01-01|    754|    754|    754|    754|
only showing top 10 rows

which looks ok, and I rerun my chart, to see if there are errors/spikes: 看起来不错,我重新运行我的图表,看看是否有错误/峰值:


As you can see, no more spikes or errors. 如您所见,没有更多的峰值或错误。 I now use a UDF as I've shown to apply my date format transformations to a standard yyyy-MM-dd format, and have not had spurious errors since. 我现在使用UDF,因为我已经证明可以将我的日期格式转换应用于标准的yyyy-MM-dd格式,并且从那时起就没有虚假错误。 :-) :-)

Use below function in PySpark to convert datatype into your required datatype.在 PySpark 中使用以下函数将数据类型转换为所需的数据类型。 Here I'm converting all the date datatype into the Timestamp column.在这里,我将所有日期数据类型转换为 Timestamp 列。

def change_dtype(df):
    for name, dtype in df.dtypes:
        if dtype == "date":
            df = df.withColumn(name, col(name).cast('timestamp'))
    return df

你可以简单地做df.withColumn("date", date_format(col("string"),"yyyy-MM-dd HH:mm:ss.ssssss")).show()

This works in Spark SQL:这适用于 Spark SQL:
TO_DATE(date_string_or_column, 'yyyy-MM-dd') AS date_column_name . TO_DATE(date_string_or_column, 'yyyy-MM-dd') AS date_column_name You can replace the second argument with however your date string is formatted, eg yyyy/MM/dd'. The return type is您可以用日期字符串的格式替换第二个参数,例如yyyy/MM/dd'. The return type is yyyy/MM/dd'. The return type is date`. yyyy/MM/dd'. The return type is date`。

When you try to change the string data type to date format when you have the string data in the format 'dd/MM/yyyy' with slashes and using spark version greater than 3.0 it converts the value to null.当您尝试将字符串数据类型更改为日期格式时,当您的字符串数据格式为“dd/MM/yyyy”时带有斜杠并使用大于 3.0 的 spark 版本,它会将值转换为 null。

In order for that to work you can set the spark configuration property which will allow you to get the output that you want.为了让它工作,您可以设置 spark 配置属性,这将允许您获得所需的输出。


and then we can use the below code to get the output that we want然后我们可以使用下面的代码来获得我们想要的输出

df.withColumn("tx_date", to_date(unix_timestamp($"date", "dd/MM/yyyy").cast("timestamp")))

