Pyspark - Convert mmddyy to YYYY-MM-DD

Question

I am working on a large file, which has one of the field in mmddyy format having string as datatype and I need to convert it into YYYY-MM-DD. I did tried creating UDF and convert referring to one of the post but its throwing error. Sample code:

Actual field in dataframe:

+-----------+
|DATE_OPENED|
+-----------+
|     072111|
|     090606|

Expected Output:

+---------------+
|    DATE_OPENED|
+---------------+
|     2011-07-21|
|     2006-06-09|

Sample Code:

func =  udf (lambda x: datetime.strptime(x, '%m%d%Y'), DateType())

newdf = olddf.withColumn('open_dt' ,date_format(func(col('DATE_OPENED')) , 'YYYY-MM-DD'))

Error:

Error : ValueError: time data '072111' does not match format '%m%d%Y'

Answer 1

I was able to solve it without creating a udf , I did refer to a similar post ( pyspark substring and aggregation ) on stack and it just worked perfectly.

from pyspark.sql.functions import *
format = 'mmddyy'
col = unix_timestamp(df1['DATE_OPENED'], format).cast('timestamp')
df1 = df1.withColumn("DATE_OPENED", col)

df2 = df.withColumn('open_dt', df['DATE_OPENED'].substr(1, 11))

Answer 2

This is possible without relying on a slow UDF . Instead, parse the data with unix_timestamp by specifying the correct format. Then cast the column to DateType which will give you the format you want by default (yyyy-mm-dd):

df.withColumn('DATE_OPENED', unix_timestamp('DATE_OPENED','mmddyy').cast(DateType()))

If you have Spark version 2.2+ there is an even more convenient method, to_date :

df.withColumn('DATE_OPENEND', to_date('DATE_OPENED','mmddyy'))

Pyspark - Convert mmddyy to YYYY-MM-DD

Question

2 answers

solution1
3 2017-12-15 20:41:07

solution2
0 2017-12-16 16:59:58

Pyspark - Convert mmddyy to YYYY-MM-DD

Question

2 answers

solution1 3 2017-12-15 20:41:07

solution2 0 2017-12-16 16:59:58

solution1
3 2017-12-15 20:41:07

solution2
0 2017-12-16 16:59:58