简体   繁体   中英

PySpark Keep only Year and Month in Date

I have a dataframe with a column date_key with Datetype. The problem is I want to create another column with only yyyy-mm part of the date_key , but still keep it date type. I tried (to_date(df[date_key],'YYYY-MM') which does not work. Also tried date_format(df[date_key] , 'YYYY-MM') but the result is string rather than date type. Could someone please help? Many thanks. The result I need to get is in the format of 2020-09 , with no date or timestamp after.

You can use date_trunc to reduce the precision of a timestamp:

df = spark.createDataFrame([['2020-09-30'], ['2020-11-11']], ['date'])\
      .select(to_date(col('date'), 'yyyy-MM-dd').alias('date_key'))
df.show()
+----------+
|  date_key|
+----------+
|2020-09-30|
|2020-11-11|
+----------+

Then truncate:

df.select(f.date_trunc('mm', col('date_key'))).show()
+------------------------+
|date_trunc(mm, date_key)|
+------------------------+
|     2020-09-01 00:00:00|
|     2020-11-01 00:00:00|
+------------------------+

date_trunc will retain the precision up to the specified format, mm in this case meaning month.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM