简体   繁体   中英

Convert PySpark String to Date with Month-Year Format

I have a PySpark dataframe with a date column encoded as a string with the following format:

df.select("issue_d").show()

+--------+
| issue_d|
+--------+
|Dec-2015|
|Dec-2015|
|Dec-2015|
|Dec-2015|
|Dec-2015|
|Dec-2015|
|Dec-2015|
|Dec-2015|
|Dec-2015|

I would like to cast this to a date column. I know I can extract the first 3 letters and map to an integer but that seems unprofessional. There must be a better way to transform this with a one/two lines of code. This is the output I would like to get:

df.select("issue_month").show()

+------------+
| issue_month|
+------------+
|12|
|12|
|12|
|12|
|12|
|12|
|12|
|12|
|12|

Use from_unixtime + unix_timestamp functions to convert the month(MMM) format to 'MM' .

Example:

#sample data
df1.show()
#+--------+
#| issue_d|
#+--------+
#|Dec-2015|
#|Jun-2015|
#+--------+

df1.selectExpr("from_unixtime(unix_timestamp(issue_d,'MMM-yyyy'),'MM') as issue_month").show()
+-----------+
|issue_month|
+-----------+
|         12|
|         06|
+-----------+

#or add as new column 

df1.withColumn("issue_month",from_unixtime(unix_timestamp(col("issue_d"),'MMM-yyyy'),'MM')).show()
#+--------+-----------+
#| issue_d|issue_month|
#+--------+-----------+
#|Dec-2015|         12|
#|Jun-2015|         06|
#+--------+-----------+

#overwrite existing column
df1.withColumn("issue_d",from_unixtime(unix_timestamp(col("issue_d"),'MMM-yyyy'),'MM')).show()
+-------+
|issue_d|
+-------+
|     12|
|     06|
+-------+

#overwrite the exisitng df1 with new column
df1=df1.withColumn("issue_month",from_unixtime(unix_timestamp(col("issue_d"),'MMM-yyyy'),'MM')).select("issue_month")
df1.show()
#+-----------+
#|issue_month|
#+-----------+
#|         12|
#|         06|
#+-----------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM