PySpark Keep only Year and Month in Date

Question

I have a dataframe with a column date_key with Datetype. The problem is I want to create another column with only yyyy-mm part of the date_key , but still keep it date type. I tried (to_date(df[date_key],'YYYY-MM') which does not work. Also tried date_format(df[date_key] , 'YYYY-MM') but the result is string rather than date type. Could someone please help? Many thanks. The result I need to get is in the format of 2020-09 , with no date or timestamp after.

Answer 1

You can use date_trunc to reduce the precision of a timestamp:

df = spark.createDataFrame([['2020-09-30'], ['2020-11-11']], ['date'])\
      .select(to_date(col('date'), 'yyyy-MM-dd').alias('date_key'))
df.show()

+----------+
|  date_key|
+----------+
|2020-09-30|
|2020-11-11|
+----------+

Then truncate:

df.select(f.date_trunc('mm', col('date_key'))).show()

+------------------------+
|date_trunc(mm, date_key)|
+------------------------+
|     2020-09-01 00:00:00|
|     2020-11-01 00:00:00|
+------------------------+

date_trunc will retain the precision up to the specified format, mm in this case meaning month.

PySpark Keep only Year and Month in Date

Question

1 answers

solution1
3 ACCPTED 2020-09-30 07:23:57

PySpark Keep only Year and Month in Date

Question

1 answers

solution1 3 ACCPTED 2020-09-30 07:23:57

solution1
3 ACCPTED 2020-09-30 07:23:57