简体   繁体   中英

How to convert multiple columns i.e time ,year,month and date into datetime format in pyspark dataframe

Data frame has 4 columns year,month,date,hhmm

hhmm - is hour and minute concatenated eg: 10:30 is equal to 1030

dd=spark.createDataFrame([(2019,2,13,1030),(2018,2,14,1000),(2029,12,13,0300)],["Year","month","date","hhmm"])
dd.collect()

expected output in datetime format in pyspark dataframe dd

dd.collect()
2019-02-13 10:30:00 
2018-2-14 10:00:00  
2019-12-13 03:00:00 

There is a problem with your data, 0300 integer will not load as the desired format, for me it loaded as 192, so first you have to load it as string, you just need to assign the data types using schema when doing the load. Refer to documentation . Eg for a .csv:

from pyspark.sql import DataFrameReader
from pyspark.sql.types import *

schema = StructType([StructField("Year", StringType(), True), StructField("month", StringType(), True), StructField("date", StringType(), True), StructField("hhmm", StringType(), True)])

dd = DataFrameReader.csv(path='your/data/path', schema=schema)

After that you need to fix the data format and convert it to timestamp:

from pyspark.sql import functions as F

dd = spark.createDataFrame([('2019','2','13','1030'),('2018','2','14','1000'),('2029','12','13','300')],["Year","month","date","hhmm"])

dd = (dd.withColumn('month', F.when(F.length(F.col('month')) == 1, F.concat(F.lit('0'), F.col('month'))).otherwise(F.col('month')))
        .withColumn('date', F.when(F.length(F.col('date')) == 1, F.concat(F.lit('0'), F.col('date'))).otherwise(F.col('date')))
        .withColumn('hhmm', F.when(F.length(F.col('hhmm')) == 1, F.concat(F.lit('000'), F.col('hhmm')))
                             .when(F.length(F.col('hhmm')) == 2, F.concat(F.lit('00'), F.col('hhmm')))
                             .when(F.length(F.col('hhmm')) == 3, F.concat(F.lit('0'), F.col('hhmm')))
                             .otherwise(F.col('hhmm')))
        .withColumn('time', F.to_timestamp(F.concat(*dd.columns), format='yyyyMMddHHmm'))
     )

dd.show()

+----+-----+----+----+-------------------+
|Year|month|date|hhmm|               time|
+----+-----+----+----+-------------------+
|2019|   02|  13|1030|2019-02-13 10:30:00|
|2018|   02|  14|1000|2018-02-14 10:00:00|
|2029|   12|  13|0300|2029-12-13 03:00:00|
+----+-----+----+----+-------------------+

For Spark 3+, you can use make_timestamp function :

from pyspark.sql import functions as F

dd = dd.withColumn(
    "time",
    F.expr("make_timestamp(Year, month, date, substr(hhmm,1,2), substr(hhmm,3,2), 0)")
)

dd.show(truncate=False)

#+----+-----+----+----+-------------------+
#|Year|month|date|hhmm|time               |
#+----+-----+----+----+-------------------+
#|2019|2    |13  |1030|2019-02-13 10:30:00|
#|2018|2    |14  |1000|2018-02-14 10:00:00|
#|2029|12   |13  |0300|2029-12-13 03:00:00|
#+----+-----+----+----+-------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM