Pyspark：从日期时间值中提取日期

Question

I am trying to figure out, how to extract a date from a datetime value using Pyspark sql.我想弄清楚，如何使用 Pyspark sql 从日期时间值中提取日期。

The datetime values look like this:日期时间值如下所示：

DateTime
2018-05-21T00:00:00.000-04:00
2016-02-22T02:00:02.234-06:00

When I now load this into a spark dataframe and try to extract the date (via当我现在将其加载到 spark 数据框中并尝试提取日期时（通过

Date() or
Timestamp() and then Date()

I always get the error, that a date or timestamp value is expected, but a DateTime value was provided.我总是收到错误，即需要日期或时间戳值，但提供了 DateTime 值。

Can someone help me with retrieving the date from this value?有人可以帮我从这个值中检索日期吗？ I think, you need to provide a timezone for that - but since I already had problems extracting only the date, I first wanted to solve this.我认为，您需要为此提供一个时区 - 但由于我已经在提取日期时遇到了问题，我首先想解决这个问题。

Thank you and kind regards.谢谢你和亲切的问候。

Answer 1

Pyspark has a to_date function to extract the date from a timestamp. Pyspark 有一个to_date函数可以从时间戳中提取日期。 In your example you could create a new column with just the date by doing the following:在您的示例中，您可以通过执行以下操作创建一个仅包含日期的新列：

df = df.withColumn("date_only", func.to_date(func.col("DateTime")))

If the column you are trying to convert is a string you can set the format parameter of to_date specifying the datetime format of the string.如果您尝试转换的列是字符串，您可以设置to_date的format参数，指定字符串的日期时间格式。

You can read more about to_date in the documentation here .您可以在此处的文档中阅读有关to_date更多信息。

Answer 2

You can use either date_format (or) from_unixtime (or) to_date functions to extract date from the input string.您可以使用date_format （或） from_unixtime （或） to_date函数从输入字符串中提取日期。

Example:例子：

Input data df data as follows..输入数据df数据如下..

#sample dataframe
df=spark.createDataFrame([('2018-05-21T00:00:00.000-04:00',),('2016-02-22T02:00:02.234-06:00',)],['ts'])

#set UTC timestamp
spark.sql("set spark.sql.session.timeZone=UTC")

df.show(10,False)
#+-----------------------------+
#|ts                           |
#+-----------------------------+
#|2018-05-21T00:00:00.000-04:00|
#|2016-02-22T02:00:02.234-06:00|
#+-----------------------------+

1. Using date_format() function: 1. 使用date_format()函数：

from pyspark.sql.functions import *
df.select(date_format(col('ts'),"yyyy-MM-dd").alias('ts').cast("date")).show(10,False)
#+----------+
#|ts        |
#+----------+
#|2018-05-21|
#|2016-02-22|
#+----------+

2. Using to_date() function: 2. 使用to_date()函数：

df.select(to_date(col('ts')).alias('ts').cast("date")).show(10,False)
#+----------+
#|ts        |
#+----------+
#|2018-05-21|
#|2016-02-22|
#+----------+

3. Using from_unixtime(unix_timestamp()) functions: 3. 使用from_unixtime(unix_timestamp())函数：

df.select(from_unixtime(unix_timestamp(col('ts'),"yyyy-MM-dd'T'HH:mm:ss.SSS"),"yyyy-MM-dd").alias("ts").cast("date")).show(10,False)
#+----------+
#|ts        |
#+----------+
#|2018-05-21|
#|2016-02-22|
#+----------+

Pyspark：从日期时间值中提取日期

问题描述

2 个解决方案

解决方案1
25 2018-08-16 15:50:57

解决方案2
19 2018-08-18 19:28:11

Pyspark：从日期时间值中提取日期

问题描述

2 个解决方案

解决方案1 25 2018-08-16 15:50:57

解决方案2 19 2018-08-18 19:28:11

解决方案1
25 2018-08-16 15:50:57

解决方案2
19 2018-08-18 19:28:11