简体   繁体   English

将日期从字符串转换为数据框中的日期格式

[英]Convert date from String to Date format in Dataframes

I am trying to convert a column which is in String format to Date format using the to_date function but its returning Null values.我正在尝试使用to_date函数将字符串格式的列转换为日期格式,但它返回 Null 值。

df.createOrReplaceTempView("incidents")
spark.sql("select Date from incidents").show()

+----------+
|      Date|
+----------+
|08/26/2016|
|08/26/2016|
|08/26/2016|
|06/14/2016|

spark.sql("select to_date(Date) from incidents").show()

+---------------------------+
|to_date(CAST(Date AS DATE))|
 +---------------------------+
|                       null|
|                       null|
|                       null|
|                       null|

The Date column is in String format:日期列为字符串格式:

 |-- Date: string (nullable = true)

Use to_date with Java SimpleDateFormat . 使用to_date和Java SimpleDateFormat

TO_DATE(CAST(UNIX_TIMESTAMP(date, 'MM/dd/yyyy') AS TIMESTAMP))

Example: 例:

spark.sql("""
  SELECT TO_DATE(CAST(UNIX_TIMESTAMP('08/26/2016', 'MM/dd/yyyy') AS TIMESTAMP)) AS newdate"""
).show()

+----------+
|        dt|
+----------+
|2016-08-26|
+----------+

I solved the same problem without the temp table/view and with dataframe functions. 我在没有临时表/视图和数据帧功能的情况下解决了同样的问题。

Of course I found that only one format works with this solution and that's yyyy-MM-DD . 当然我发现只有一种格式适用于这种解决方案,那就是yyyy-MM-DD

For example: 例如:

val df = sc.parallelize(Seq("2016-08-26")).toDF("Id")
val df2 = df.withColumn("Timestamp", (col("Id").cast("timestamp")))
val df3 = df2.withColumn("Date", (col("Id").cast("date")))

df3.printSchema

root
 |-- Id: string (nullable = true)
 |-- Timestamp: timestamp (nullable = true)
 |-- Date: date (nullable = true)

df3.show

+----------+--------------------+----------+
|        Id|           Timestamp|      Date|
+----------+--------------------+----------+
|2016-08-26|2016-08-26 00:00:...|2016-08-26|
+----------+--------------------+----------+

The timestamp of course has 00:00:00.0 as a time value. 时间戳当然是00:00:00.0作为时间值。

Since your main aim was to convert the type of a column in a DataFrame from String to Timestamp, I think this approach would be better. 由于您的主要目标是将DataFrame中的列类型从String转换为Timestamp,我认为这种方法会更好。

import org.apache.spark.sql.functions.{to_date, to_timestamp}
val modifiedDF = DF.withColumn("Date", to_date($"Date", "MM/dd/yyyy"))

You could also use to_timestamp (I think this is available from Spark 2.x) if you require fine grained timestamp. 如果你需要细粒度的时间戳,你也可以使用to_timestamp (我认为这可以从Spark 2.x获得)。

you can also do this query...! 你也可以这样查询...!

sqlContext.sql("""
select from_unixtime(unix_timestamp('08/26/2016', 'MM/dd/yyyy'), 'yyyy:MM:dd') as new_format
""").show()

在此输入图像描述

You can also pass date format 您还可以传递日期格式

df.withColumn("Date",to_date(unix_timestamp(df.col("your_date_column"), "your_date_format").cast("timestamp")))

For Example 例如

import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq("06 Jul 2018")).toDF("dateCol")
df.withColumn("Date",to_date(unix_timestamp(df.col("dateCol"), "dd MMM yyyy").cast("timestamp")))

dateID是int列,包含Int格式的日期

spark.sql("SELECT from_unixtime(unix_timestamp(cast(dateid as varchar(10)), 'yyyymmdd'), 'yyyy-mm-dd') from XYZ").show(50, false)

Find the below-mentioned code, it might be helpful for you.找到下面提到的代码,它可能对你有帮助。

   val stringDate = spark.sparkContext.parallelize(Seq("12/16/2019")).toDF("StringDate")
                    val dateCoversion = stringDate.withColumn("dateColumn", to_date(unix_timestamp($"StringDate", "dd/mm/yyyy").cast("Timestamp")))
                    dateCoversion.show(false)
+----------+----------+
|StringDate|dateColumn|
+----------+----------+
|12/16/2019|2019-01-12|
+----------+----------+

The solution proposed above by Sai Kiriti Badam worked for me. Sai Kiriti Badam上面提出的解决方案为我工作。

I'm using Azure Databricks to read data captured from an EventHub. 我正在使用Azure Databricks来读取从EventHub捕获的数据。 This contains a string column named EnqueuedTimeUtc with the following format... 它包含一个名为EnqueuedTimeUtc的字符串列,格式如下......

12/7/2018 12:54:13 PM 12/7/2018 12:54:13 PM

I'm using a Python notebook and used the following... 我正在使用Python笔记本并使用以下内容...

import pyspark.sql.functions as func

sports_messages = sports_df.withColumn("EnqueuedTimestamp", func.to_timestamp("EnqueuedTimeUtc", "MM/dd/yyyy hh:mm:ss aaa"))

... to create a new column EnqueuedTimestamp of type "timestamp" with data in the following format... ...使用以下格式的数据创建一个类型为“timestamp”的新列EnqueuedTimestamp ...

2018-12-07 12:54:13 2018-12-07 12:54:13

I have personally found some errors in when using unix_timestamp based date converstions from dd-MMM-yyyy format to yyyy-mm-dd, using spark 1.6, but this may extend into recent versions. 我个人发现在使用基于unix_timestamp的日期转换从dd-MMM-yyyy格式到yyyy-mm-dd时使用spark 1.6时会出现一些错误,但这可能会扩展到最新版本。 Below I explain a way to solve the problem using java.time that should work in all versions of spark: 下面我解释一种使用java.time解决问题的方法,该方法应该适用于所有版本的spark:

I've seen errors when doing: 在做的时候我看到了错误:

from_unixtime(unix_timestamp(StockMarketClosingDate, 'dd-MMM-yyyy'), 'yyyy-MM-dd') as FormattedDate

Below is code to illustrate the error, and my solution to fix it. 下面是用于说明错误的代码,以及我解决它的解决方案。 First I read in stock market data, in a common standard file format: 首先,我以通用的标准文件格式阅读股票市场数据:

    import sys.process._
    import org.apache.spark.sql.SQLContext
    import org.apache.spark.sql.functions.udf
    import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DateType}
    import sqlContext.implicits._

    val EODSchema = StructType(Array(
        StructField("Symbol"                , StringType, true),     //$1       
        StructField("Date"                  , StringType, true),     //$2       
        StructField("Open"                  , StringType, true),     //$3       
        StructField("High"                  , StringType, true),     //$4
        StructField("Low"                   , StringType, true),     //$5
        StructField("Close"                 , StringType, true),     //$6
        StructField("Volume"                , StringType, true)      //$7
        ))

    val textFileName = "/user/feeds/eoddata/INDEX/INDEX_19*.csv"

    // below is code to read using later versions of spark
    //val eoddata = spark.read.format("csv").option("sep", ",").schema(EODSchema).option("header", "true").load(textFileName)


    // here is code to read using 1.6, via, "com.databricks:spark-csv_2.10:1.2.0"

    val eoddata = sqlContext.read
                               .format("com.databricks.spark.csv")
                               .option("header", "true")                               // Use first line of all files as header
                               .option("delimiter", ",")                               //.option("dateFormat", "dd-MMM-yyyy") failed to work
                               .schema(EODSchema)
                               .load(textFileName)

    eoddata.registerTempTable("eoddata")

And here is the date conversions having issues: 以下是有问题的日期转换:

%sql 
-- notice there are errors around the turn of the year
Select 
    e.Date as StringDate
,   cast(from_unixtime(unix_timestamp(e.Date, "dd-MMM-yyyy"), 'YYYY-MM-dd') as Date)  as ProperDate
,   e.Close
from eoddata e
where e.Symbol = 'SPX.IDX'
order by cast(from_unixtime(unix_timestamp(e.Date, "dd-MMM-yyyy"), 'YYYY-MM-dd') as Date)
limit 1000

A chart made in zeppelin shows spikes, which are errors. 在zeppelin中制作的图表显示了峰值,这是错误。

日期转换中的错误被视为峰值

and here is the check that shows the date conversion errors: 这是显示日期转换错误的检查:

// shows the unix_timestamp conversion approach can create errors
val result =  sqlContext.sql("""
Select errors.* from
(
    Select 
    t.*
    , substring(t.OriginalStringDate, 8, 11) as String_Year_yyyy 
    , substring(t.ConvertedCloseDate, 0, 4)  as Converted_Date_Year_yyyy
    from
    (        Select
                Symbol
            ,   cast(from_unixtime(unix_timestamp(e.Date, "dd-MMM-yyyy"), 'YYYY-MM-dd') as Date)  as ConvertedCloseDate
            ,   e.Date as OriginalStringDate
            ,   Close
            from eoddata e
            where e.Symbol = 'SPX.IDX'
    ) t 
) errors
where String_Year_yyyy <> Converted_Date_Year_yyyy
""")


//df.withColumn("tx_date", to_date(unix_timestamp($"date", "M/dd/yyyy").cast("timestamp")))


result.registerTempTable("SPX")
result.cache()
result.show(100)
result: org.apache.spark.sql.DataFrame = [Symbol: string, ConvertedCloseDate: date, OriginalStringDate: string, Close: string, String_Year_yyyy: string, Converted_Date_Year_yyyy: string]
res53: result.type = [Symbol: string, ConvertedCloseDate: date, OriginalStringDate: string, Close: string, String_Year_yyyy: string, Converted_Date_Year_yyyy: string]
+-------+------------------+------------------+-------+----------------+------------------------+
| Symbol|ConvertedCloseDate|OriginalStringDate|  Close|String_Year_yyyy|Converted_Date_Year_yyyy|
+-------+------------------+------------------+-------+----------------+------------------------+
|SPX.IDX|        1997-12-30|       30-Dec-1996| 753.85|            1996|                    1997|
|SPX.IDX|        1997-12-31|       31-Dec-1996| 740.74|            1996|                    1997|
|SPX.IDX|        1998-12-29|       29-Dec-1997| 953.36|            1997|                    1998|
|SPX.IDX|        1998-12-30|       30-Dec-1997| 970.84|            1997|                    1998|
|SPX.IDX|        1998-12-31|       31-Dec-1997| 970.43|            1997|                    1998|
|SPX.IDX|        1998-01-01|       01-Jan-1999|1229.23|            1999|                    1998|
+-------+------------------+------------------+-------+----------------+------------------------+
FINISHED   

After this result, I switched to java.time conversions with a UDF like this, which worked for me: 在这个结果之后,我用这样的UDF切换到java.time转换,这对我有用:

// now we will create a UDF that uses the very nice java.time library to properly convert the silly stockmarket dates
// start by importing the specific java.time libraries that superceded the joda.time ones
import java.time.LocalDate
import java.time.format.DateTimeFormatter

// now define a specific data conversion function we want

def fromEODDate (YourStringDate: String): String = {

    val formatter = DateTimeFormatter.ofPattern("dd-MMM-yyyy")
    var   retDate = LocalDate.parse(YourStringDate, formatter)

    // this should return a proper yyyy-MM-dd date from the silly dd-MMM-yyyy formats
    // now we format this true local date with a formatter to the desired yyyy-MM-dd format

    val retStringDate = retDate.format(DateTimeFormatter.ISO_LOCAL_DATE)
    return(retStringDate)
}

Now I register it as a function for use in sql: 现在我将其注册为在sql中使用的函数:

sqlContext.udf.register("fromEODDate", fromEODDate(_:String))

and check the results, and rerun test: 并检查结果,并重新运行测试:

val results = sqlContext.sql("""
    Select
        e.Symbol    as Symbol
    ,   e.Date      as OrigStringDate
    ,   Cast(fromEODDate(e.Date) as Date) as ConvertedDate
    ,   e.Open
    ,   e.High
    ,   e.Low
    ,   e.Close
    from eoddata e
    order by Cast(fromEODDate(e.Date) as Date)
""")

results.printSchema()
results.cache()
results.registerTempTable("results")
results.show(10)
results: org.apache.spark.sql.DataFrame = [Symbol: string, OrigStringDate: string, ConvertedDate: date, Open: string, High: string, Low: string, Close: string]
root
 |-- Symbol: string (nullable = true)
 |-- OrigStringDate: string (nullable = true)
 |-- ConvertedDate: date (nullable = true)
 |-- Open: string (nullable = true)
 |-- High: string (nullable = true)
 |-- Low: string (nullable = true)
 |-- Close: string (nullable = true)
res79: results.type = [Symbol: string, OrigStringDate: string, ConvertedDate: date, Open: string, High: string, Low: string, Close: string]
+--------+--------------+-------------+-------+-------+-------+-------+
|  Symbol|OrigStringDate|ConvertedDate|   Open|   High|    Low|  Close|
+--------+--------------+-------------+-------+-------+-------+-------+
|ADVA.IDX|   01-Jan-1996|   1996-01-01|    364|    364|    364|    364|
|ADVN.IDX|   01-Jan-1996|   1996-01-01|   1527|   1527|   1527|   1527|
|ADVQ.IDX|   01-Jan-1996|   1996-01-01|   1283|   1283|   1283|   1283|
|BANK.IDX|   01-Jan-1996|   1996-01-01|1009.41|1009.41|1009.41|1009.41|
| BKX.IDX|   01-Jan-1996|   1996-01-01|  39.39|  39.39|  39.39|  39.39|
|COMP.IDX|   01-Jan-1996|   1996-01-01|1052.13|1052.13|1052.13|1052.13|
| CPR.IDX|   01-Jan-1996|   1996-01-01|  1.261|  1.261|  1.261|  1.261|
|DECA.IDX|   01-Jan-1996|   1996-01-01|    205|    205|    205|    205|
|DECN.IDX|   01-Jan-1996|   1996-01-01|    825|    825|    825|    825|
|DECQ.IDX|   01-Jan-1996|   1996-01-01|    754|    754|    754|    754|
+--------+--------------+-------------+-------+-------+-------+-------+
only showing top 10 rows

which looks ok, and I rerun my chart, to see if there are errors/spikes: 看起来不错,我重新运行我的图表,看看是否有错误/峰值:

在此输入图像描述

As you can see, no more spikes or errors. 如您所见,没有更多的峰值或错误。 I now use a UDF as I've shown to apply my date format transformations to a standard yyyy-MM-dd format, and have not had spurious errors since. 我现在使用UDF,因为我已经证明可以将我的日期格式转换应用于标准的yyyy-MM-dd格式,并且从那时起就没有虚假错误。 :-) :-)

Use below function in PySpark to convert datatype into your required datatype.在 PySpark 中使用以下函数将数据类型转换为所需的数据类型。 Here I'm converting all the date datatype into the Timestamp column.在这里,我将所有日期数据类型转换为 Timestamp 列。

def change_dtype(df):
    for name, dtype in df.dtypes:
        if dtype == "date":
            df = df.withColumn(name, col(name).cast('timestamp'))
    return df

你可以简单地做df.withColumn("date", date_format(col("string"),"yyyy-MM-dd HH:mm:ss.ssssss")).show()

This works in Spark SQL:这适用于 Spark SQL:
TO_DATE(date_string_or_column, 'yyyy-MM-dd') AS date_column_name . TO_DATE(date_string_or_column, 'yyyy-MM-dd') AS date_column_name You can replace the second argument with however your date string is formatted, eg yyyy/MM/dd'. The return type is您可以用日期字符串的格式替换第二个参数,例如yyyy/MM/dd'. The return type is yyyy/MM/dd'. The return type is date`. yyyy/MM/dd'. The return type is date`。

When you try to change the string data type to date format when you have the string data in the format 'dd/MM/yyyy' with slashes and using spark version greater than 3.0 it converts the value to null.当您尝试将字符串数据类型更改为日期格式时,当您的字符串数据格式为“dd/MM/yyyy”时带有斜杠并使用大于 3.0 的 spark 版本,它会将值转换为 null。

In order for that to work you can set the spark configuration property which will allow you to get the output that you want.为了让它工作,您可以设置 spark 配置属性,这将允许您获得所需的输出。

spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")

and then we can use the below code to get the output that we want然后我们可以使用下面的代码来获得我们想要的输出

df.withColumn("tx_date", to_date(unix_timestamp($"date", "dd/MM/yyyy").cast("timestamp")))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM