简体   繁体   中英

why am I not able to convert string type column to date format in pyspark?

I have a column which is in the "20130623" format. I am trying to convert it into dd-mm-YYYY. I have seen various post online including here. But I only got one solution as below

from datetime import datetime
df = df2.withColumn("col_name", datetime.utcfromtimestamp(int("col_name")).strftime('%d-%m-%y'))

However, it throws an error that the input should be int type , not the string type . I tried to convert with the help of int() function. But even that doesn't seem to be helping.

below is the error that I see when converting

invalid literal for int() with base 10: 'col_name'

I am not sure if it is taking the col_name as string, or its value as string. Please suggest, how can I do this, or the best way to get the required output

Note: I cannot use pandas in my environment.

thank you.

withColumn() should use Spark types. You can not use python datetime there.

First parse string to date using to_date and then format date to string using date_format :

import pyspark.sql.functions as F

df = spark.createDataFrame(data=[["20130623"]], schema=["date_str"])

df = df.withColumn("date", F.to_date("date_str", format="yyyyMMdd"))
df = df.withColumn("date_ddMMyyyy", F.date_format("date", format="dd-MM-yyyy"))

df.show(truncate=False)
df.printSchema()

Output:

+--------+----------+-------------+
|date_str|date      |date_ddMMyyyy|
+--------+----------+-------------+
|20130623|2013-06-23|23-06-2013   |
+--------+----------+-------------+

root
 |-- date_str: string (nullable = true)
 |-- date: date (nullable = true)
 |-- date_ddMMyyyy: string (nullable = true)

Whatever you see as output 2013-06-23 in date column is actually default to_string() representation of date.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM