Data frame has 4 columns year,month,date,hhmm
hhmm - is hour and minute concatenated eg: 10:30 is equal to 1030
dd=spark.createDataFrame([(2019,2,13,1030),(2018,2,14,1000),(2029,12,13,0300)],["Year","month","date","hhmm"])
dd.collect()
expected output in datetime format in pyspark dataframe dd
dd.collect()
2019-02-13 10:30:00
2018-2-14 10:00:00
2019-12-13 03:00:00
There is a problem with your data, 0300 integer will not load as the desired format, for me it loaded as 192, so first you have to load it as string, you just need to assign the data types using schema when doing the load. Refer to documentation . Eg for a .csv:
from pyspark.sql import DataFrameReader
from pyspark.sql.types import *
schema = StructType([StructField("Year", StringType(), True), StructField("month", StringType(), True), StructField("date", StringType(), True), StructField("hhmm", StringType(), True)])
dd = DataFrameReader.csv(path='your/data/path', schema=schema)
After that you need to fix the data format and convert it to timestamp:
from pyspark.sql import functions as F
dd = spark.createDataFrame([('2019','2','13','1030'),('2018','2','14','1000'),('2029','12','13','300')],["Year","month","date","hhmm"])
dd = (dd.withColumn('month', F.when(F.length(F.col('month')) == 1, F.concat(F.lit('0'), F.col('month'))).otherwise(F.col('month')))
.withColumn('date', F.when(F.length(F.col('date')) == 1, F.concat(F.lit('0'), F.col('date'))).otherwise(F.col('date')))
.withColumn('hhmm', F.when(F.length(F.col('hhmm')) == 1, F.concat(F.lit('000'), F.col('hhmm')))
.when(F.length(F.col('hhmm')) == 2, F.concat(F.lit('00'), F.col('hhmm')))
.when(F.length(F.col('hhmm')) == 3, F.concat(F.lit('0'), F.col('hhmm')))
.otherwise(F.col('hhmm')))
.withColumn('time', F.to_timestamp(F.concat(*dd.columns), format='yyyyMMddHHmm'))
)
dd.show()
+----+-----+----+----+-------------------+
|Year|month|date|hhmm| time|
+----+-----+----+----+-------------------+
|2019| 02| 13|1030|2019-02-13 10:30:00|
|2018| 02| 14|1000|2018-02-14 10:00:00|
|2029| 12| 13|0300|2029-12-13 03:00:00|
+----+-----+----+----+-------------------+
For Spark 3+, you can use make_timestamp
function :
from pyspark.sql import functions as F
dd = dd.withColumn(
"time",
F.expr("make_timestamp(Year, month, date, substr(hhmm,1,2), substr(hhmm,3,2), 0)")
)
dd.show(truncate=False)
#+----+-----+----+----+-------------------+
#|Year|month|date|hhmm|time |
#+----+-----+----+----+-------------------+
#|2019|2 |13 |1030|2019-02-13 10:30:00|
#|2018|2 |14 |1000|2018-02-14 10:00:00|
#|2029|12 |13 |0300|2029-12-13 03:00:00|
#+----+-----+----+----+-------------------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.