简体   繁体   English

如何在 PySpark 中从年、月和日创建日期?

[英]How to create date from year, month and day in PySpark?

I have three columns about year, month and day.我有关于年、月和日的三列。 How can I use these to create date in PySpark?如何使用这些在 PySpark 中创建日期?

You can use concat_ws() to concat columns with - and cast to date.您可以使用concat_ws()-列并转换为日期。

#sampledata
df.show()

#+----+-----+---+
#|year|month|day|
#+----+-----+---+
#|2020|   12| 12|
#+----+-----+---+
from pyspark.sql.functions import *

df.withColumn("date",concat_ws("-",col("year"),col("month"),col("day")).cast("date")).show()
+----+-----+---+----------+
|year|month|day|      date|
+----+-----+---+----------+
|2020|   12| 12|2020-12-12|
+----+-----+---+----------+

#dynamic way
cols=["year","month","day"]
df.withColumn("date",concat_ws("-",*cols).cast("date")).show()
#+----+-----+---+----------+
#|year|month|day|      date|
#+----+-----+---+----------+
#|2020|   12| 12|2020-12-12|
#+----+-----+---+----------+

#using date_format,to_timestamp,from_unixtime(unix_timestamp) functions

df.withColumn("date",date_format(concat_ws("-",*cols),"yyyy-MM-dd").cast("date")).show()
df.withColumn("date",to_timestamp(concat_ws("-",*cols),"yyyy-MM-dd").cast("date")).show()
df.withColumn("date",to_date(concat_ws("-",*cols),"yyyy-MM-dd")).show()
df.withColumn("date",from_unixtime(unix_timestamp(concat_ws("-",*cols),"yyyy-MM-dd"),"yyyy-MM-dd").cast("date")).show()
#+----+-----+---+----------+
#|year|month|day|      date|
#+----+-----+---+----------+
#|2020|   12| 12|2020-12-12|
#+----+-----+---+----------+

对于 Spark 3+,您可以使用make_date函数:

df = df.withColumn("date", expr("make_date(year, month, day)"))

Using pyspark on DataBrick, here is a solution when you have a pure string;在 DataBrick 上使用 pyspark,这是一个纯字符串时的解决方案; unix_timestamp may not work unfortunately and yields wrong results.不幸的是,unix_timestamp 可能无法正常工作并产生错误的结果。 be very causious when using unix_timestamp, or to_date commands in pyspark.在 pyspark 中使用 unix_timestamp 或 to_date 命令时非常麻烦。 for example if your string has a fromat like "20140625" they simply generate totally wrong version of input dates.例如,如果您的字符串具有像“20140625”这样的 fromat,它们只会生成完全错误的输入日期版本。 In my case no method works except concatantion from building the string again and cast it as date as follows.在我的情况下,除了再次构建字符串的连接并将其转换为日期外,没有任何方法有效,如下所示。

from pyspark.sql.functions import col, lit, substring, concat

# string format to deal with: "20050627","19900401",...

#Create a new column with a shorter name to keep the originalcolumns as well
df.withColumn("dod",col("date_of_death"))

#create date upon string components
df.withColumn("dod", concat(substring(df.dod,1,4),lit("-"),substring(df.dod,5,2),lit("-"),substring(df.dod,7,2)).cast("date")))

the results look like this:结果如下所示:

在此处输入图片说明

beware of using the following format.请注意使用以下格式。 it most probabily and oddly generates wrong results without raising and showing you any error.它最有可能和奇怪的是生成错误的结果,而不会引发和显示任何错误。 in my case it ruinedd most of my analsyse:就我而言,它破坏了我的大部分分析:

### wrong use! use only on strings with delimeters ("yyyy-mm-dd) and be highly causious!
f.to_date(f.unix_timestamp(df.dod,"yyyymmdd").cast("timestamp"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM