简体   繁体   English

在 Spark 数据帧中将时间戳转换为日期

[英]Convert timestamp to date in Spark dataframe

I've seen (here: How to convert Timestamp to Date format in DataFrame? ) the way to convert a timestamp in datetype, but,at least for me, it doesn't work.我已经看到(这里: How to convert Timestamp to Date format in DataFrame? )在日期类型中转换时间戳的方法,但是,至少对我来说,它不起作用。

Here is what I've tried:这是我尝试过的:

# Create dataframe
df_test = spark.createDataFrame([('20170809',), ('20171007',)], ['date',])

# Convert to timestamp
df_test2 = df_test.withColumn('timestamp',func.when((df_test.date.isNull() | (df_test.date == '')) , '0')\
.otherwise(func.unix_timestamp(df_test.date,'yyyyMMdd')))\

# Convert timestamp to date again
df_test2.withColumn('date_again', df_test2['timestamp'].cast(stypes.DateType())).show()

But this returns null in the column date_again :但这在date_again列中返回 null :

+--------+----------+----------+
|    date| timestamp|date_again|
+--------+----------+----------+
|20170809|1502229600|      null|
|20171007|1507327200|      null|
+--------+----------+----------+

Any idea of what's failing?知道什么失败了吗?

Following:以下:

func.when((df_test.date.isNull() | (df_test.date == '')) , '0')\
  .otherwise(func.unix_timestamp(df_test.date,'yyyyMMdd'))

doesn't work because it is type inconsistent - the first clause returns string while the second clause returns bigint .不起作用,因为它的类型不一致 - 第一个子句返回string而第二个子句返回bigint As a result it will always return NULL if data is NOT NULL and not empty.因此,如果data NOT NULL且不为空,它将始终返回NULL

It is also obsolete - SQL functions are NULL and malformed format safe.它也已过时 - SQL 函数是NULL和格式错误的安全。 There is no need for additional checks.不需要额外的检查。

In [1]: spark.sql("SELECT unix_timestamp(NULL, 'yyyyMMdd')").show()
+----------------------------------------------+
|unix_timestamp(CAST(NULL AS STRING), yyyyMMdd)|
+----------------------------------------------+
|                                          null|
+----------------------------------------------+


In [2]: spark.sql("SELECT unix_timestamp('', 'yyyyMMdd')").show()
+--------------------------+
|unix_timestamp(, yyyyMMdd)|
+--------------------------+
|                      null|
+--------------------------+

And you don't need intermediate step in Spark 2.2 or later:而且在 Spark 2.2 或更高版本中您不需要中间步骤:

from pyspark.sql.functions import to_date

to_date("date", "yyyyMMdd")

you should be doing the following你应该做以下事情

>>> df_test2.withColumn('date_again', func.from_unixtime('timestamp').cast(DateType())).show()
+--------+----------+----------+
|    date| timestamp|date_again|
+--------+----------+----------+
|20170809|1502216100|2017-08-09|
|20171007|1507313700|2017-10-07|
+--------+----------+----------+

and schema is和架构是

>>> df_test2.withColumn('date_again', func.from_unixtime('timestamp').cast(DateType())).printSchema()
root
 |-- date: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- date_again: date (nullable = true)

For pyspark:对于pyspark:

Assume you have a field name: 'DateTime' that shows the date as a date and a time假设您有一个字段名称: “DateTime” ,它将日期显示为日期和时间

Add a new field to your df that shows a 'DateOnly' column as follows:df添加一个新字段,显示“DateOnly”列,如下所示:

 from pyspark.sql.functions  import date_format
    df.withColumn("DateOnly", date_format('DateTime', "yyyyMMdd")).show()

This will show a new column in the df called DateOnly - with the date in yyyymmdd form这将在df 中显示一个名为DateOnly的新列 - 日期为yyyymmdd形式

To convert a unix_timestamp column (called TIMESTMP ) in a pyspark dataframe ( df ) -- to a Date type:要将 pyspark 数据帧 ( df ) 中的unix_timestamp列(称为TIMESTMP )转换为Date类型:

Below is a two step process (there may be a shorter way):下面是一个两步过程(可能有更短的方法):

  • convert from UNIX timestamp to timestamp从 UNIX 时间戳转换为timestamp
  • convert from timestamp to Datetimestamp转换为Date

Initially the df.printShchema() shows: -- TIMESTMP: long (nullable = true)最初df.printShchema()显示: -- TIMESTMP: long (nullable = true)

use spark.SQL to implement the conversion as follows:使用spark.SQL实现转换如下:

df.registerTempTable("dfTbl")

dfNew= spark.sql("""
                     SELECT *, cast(TIMESTMP as Timestamp) as newTIMESTMP 
                     FROM dfTbl d
                  """)

dfNew.printSchema()

the printSchema() will show: printSchema() 将显示:

-- newTIMESTMP: timestamp (nullable = true)

finally convert the type from timestamp to Date as follows:最后将类型从timestamp转换为Date ,如下所示:

from pyspark.sql.types import DateType
dfNew=dfNew.withColumn('actual_date', dfNew['newTIMESTMP'].cast(DateType()))
#udf to convert the ts to timestamp
get_timestamp = udf(lambda x : datetime.datetime.fromtimestamp(x/ 1000.0).strftime("%Y-%m-%d %H:%M:%S"))

#apply this udf in the dataframe with your timestamp
df_withdate = df.withColumn("datetime", get_timestamp(df.ts))

they closed my question as duplicate of this one so I'll copy and paste my answer here (is a duplicate, right?)他们关闭了我的问题作为这个问题的重复,所以我将在这里复制并粘贴我的答案(是重复的,对吗?)

As the timestamp column is in milliseconds is just necessary to convert into seconds and cast it into TimestampType and that should do the trick:由于时间戳列以毫秒为单位,因此只需将其转换为秒并将其转换为TimestampType就可以了:

from pyspark.sql.types import TimestampType
import pyspark.sql.functions as F

df.select( 
      (F.col("my_timestamp") / 1000).cast(TimestampType())
)

An option without import TimestampType :没有import TimestampType的选项:

import pyspark.sql.functions as F

F.from_unixtime(F.col('date_col') / 1000).cast('date')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM