How to convert multiple columns i.e time ,year,month and date into datetime format in pyspark dataframe

Question

Data frame has 4 columns year,month,date,hhmm

hhmm - is hour and minute concatenated eg: 10:30 is equal to 1030

dd=spark.createDataFrame([(2019,2,13,1030),(2018,2,14,1000),(2029,12,13,0300)],["Year","month","date","hhmm"])
dd.collect()

expected output in datetime format in pyspark dataframe dd

dd.collect()
2019-02-13 10:30:00 
2018-2-14 10:00:00  
2019-12-13 03:00:00

Answer 1

There is a problem with your data, 0300 integer will not load as the desired format, for me it loaded as 192, so first you have to load it as string, you just need to assign the data types using schema when doing the load. Refer to documentation . Eg for a .csv:

from pyspark.sql import DataFrameReader
from pyspark.sql.types import *

schema = StructType([StructField("Year", StringType(), True), StructField("month", StringType(), True), StructField("date", StringType(), True), StructField("hhmm", StringType(), True)])

dd = DataFrameReader.csv(path='your/data/path', schema=schema)

After that you need to fix the data format and convert it to timestamp:

from pyspark.sql import functions as F

dd = spark.createDataFrame([('2019','2','13','1030'),('2018','2','14','1000'),('2029','12','13','300')],["Year","month","date","hhmm"])

dd = (dd.withColumn('month', F.when(F.length(F.col('month')) == 1, F.concat(F.lit('0'), F.col('month'))).otherwise(F.col('month')))
        .withColumn('date', F.when(F.length(F.col('date')) == 1, F.concat(F.lit('0'), F.col('date'))).otherwise(F.col('date')))
        .withColumn('hhmm', F.when(F.length(F.col('hhmm')) == 1, F.concat(F.lit('000'), F.col('hhmm')))
                             .when(F.length(F.col('hhmm')) == 2, F.concat(F.lit('00'), F.col('hhmm')))
                             .when(F.length(F.col('hhmm')) == 3, F.concat(F.lit('0'), F.col('hhmm')))
                             .otherwise(F.col('hhmm')))
        .withColumn('time', F.to_timestamp(F.concat(*dd.columns), format='yyyyMMddHHmm'))
     )

dd.show()

+----+-----+----+----+-------------------+
|Year|month|date|hhmm|               time|
+----+-----+----+----+-------------------+
|2019|   02|  13|1030|2019-02-13 10:30:00|
|2018|   02|  14|1000|2018-02-14 10:00:00|
|2029|   12|  13|0300|2029-12-13 03:00:00|
+----+-----+----+----+-------------------+

Answer 2

For Spark 3+, you can use make_timestamp function :

from pyspark.sql import functions as F

dd = dd.withColumn(
    "time",
    F.expr("make_timestamp(Year, month, date, substr(hhmm,1,2), substr(hhmm,3,2), 0)")
)

dd.show(truncate=False)

#+----+-----+----+----+-------------------+
#|Year|month|date|hhmm|time               |
#+----+-----+----+----+-------------------+
#|2019|2    |13  |1030|2019-02-13 10:30:00|
#|2018|2    |14  |1000|2018-02-14 10:00:00|
#|2029|12   |13  |0300|2029-12-13 03:00:00|
#+----+-----+----+----+-------------------+

How to convert multiple columns i.e time ,year,month and date into datetime format in pyspark dataframe

Question

2 answers

solution1
2 2019-05-11 19:29:21

solution2
2 2021-02-03 19:18:39

How to convert multiple columns i.e time ,year,month and date into datetime format in pyspark dataframe

Question

2 answers

solution1 2 2019-05-11 19:29:21

solution2 2 2021-02-03 19:18:39

solution1
2 2019-05-11 19:29:21

solution2
2 2021-02-03 19:18:39