Is there a way to convert a timestamp value with nano seconds to timestamp in spark. I get the input from a csv file and the timstamp value is of format 12-12-2015 14:09:36.992415+01:00
. This is the code I tried.
val date_raw_data = List((1, "12-12-2015 14:09:36.992415+01:00"))
val dateraw_df = sc.parallelize(date_raw_data).toDF("ID", "TIMESTAMP_VALUE")
val ts = unix_timestamp($"TIMESTAMP_VALUE", "MM-dd-yyyy HH:mm:ss.ffffffz").cast("double").cast("timestamp")
val date_df = dateraw_df.withColumn("TIMESTAMP_CONV", ts).show(false)
The output is
+---+-----------------------+---------------------+
|ID |TIMESTAMP_VALUE |TIMESTAMP_CONV |
+---+-----------------------+---------------------+
|1 |12-12-2015 14:09:36.992|null |
+---+-----------------------+---------------------+
I was able to convert a time stamp with millisecond using format MM-dd-yyyy HH:mm:ss.SSS
. Trouble is with nano second and timezone formats.
unix_timestamp
won't do here. Even if you could parse the string (AFAIK SimpleDateFormat
doesn't provide required formats), unix_timestamp
has only second precision (emphasis mine):
def unix_timestamp(s: Column, p: String): Column
Convert time string with given pattern (see [ http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html ]) to Unix time stamp ( in seconds ), return null if fail.
You have to create your own function to parse this data. A rough idea:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
def to_nano(c: Column) = {
val r = "([0-9]{2}-[0-9]{2}-[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2})(\\.[0-9]*)(.*)$"
// seconds part
(unix_timestamp(
concat(
regexp_extract($"TIMESTAMP_VALUE", r, 1),
regexp_extract($"TIMESTAMP_VALUE", r, 3)
), "MM-dd-YYYY HH:mm:ssXXX"
).cast("decimal(38, 9)") +
// subsecond part
regexp_extract($"TIMESTAMP_VALUE", r, 2).cast("decimal(38, 9)")).alias("value")
}
Seq("12-12-2015 14:09:36.992415+01:00").toDF("TIMESTAMP_VALUE")
.select(to_nano($"TIMESTAMP_COLUMN").cast("timestamp"))
.show(false)
// +--------------------------+
// |value |
// +--------------------------+
// |2014-12-28 14:09:36.992415|
// +--------------------------+
Here comes a dirty dirty trick without UDF to make this work if you don't care about nanoseconds. (I cannot use UDF where this is required, and cannot modify the source)
select CAST(UNIX_TIMESTAMP(substr(date,0,length(date)-4), "yyyy-MM-dd'T'HH:mm:ss.SSS") AS TIMESTAMP);
Eg
select CAST(UNIX_TIMESTAMP(substr("2020-09-14T01:14:15.596444Z",0,length("2020-09-14T01:14:15.596444Z")-4), "yyyy-MM-dd'T'HH:mm:ss.SSS") AS TIMESTAMP);
I'm basically stripping the string out of it's nanoseconds part, and parsing the rest with the spark SimpleDateFormat compatible parser.
Please future employer, don't judge me by this reply.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.