简体   繁体   中英

convert string with nanosecond into timestamp in spark

Is there a way to convert a timestamp value with nano seconds to timestamp in spark. I get the input from a csv file and the timstamp value is of format 12-12-2015 14:09:36.992415+01:00 . This is the code I tried.

val date_raw_data = List((1, "12-12-2015 14:09:36.992415+01:00"))

val dateraw_df = sc.parallelize(date_raw_data).toDF("ID", "TIMESTAMP_VALUE")

val ts = unix_timestamp($"TIMESTAMP_VALUE", "MM-dd-yyyy HH:mm:ss.ffffffz").cast("double").cast("timestamp")

val date_df = dateraw_df.withColumn("TIMESTAMP_CONV", ts).show(false)

The output is

+---+-----------------------+---------------------+
|ID |TIMESTAMP_VALUE        |TIMESTAMP_CONV       |
+---+-----------------------+---------------------+
|1  |12-12-2015 14:09:36.992|null                 |
+---+-----------------------+---------------------+

I was able to convert a time stamp with millisecond using format MM-dd-yyyy HH:mm:ss.SSS . Trouble is with nano second and timezone formats.

unix_timestamp won't do here. Even if you could parse the string (AFAIK SimpleDateFormat doesn't provide required formats), unix_timestamp has only second precision (emphasis mine):

def unix_timestamp(s: Column, p: String): Column

Convert time string with given pattern (see [ http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html ]) to Unix time stamp ( in seconds ), return null if fail.

You have to create your own function to parse this data. A rough idea:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column

def to_nano(c: Column) = {
  val r = "([0-9]{2}-[0-9]{2}-[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2})(\\.[0-9]*)(.*)$"
  // seconds part
  (unix_timestamp(
    concat(
      regexp_extract($"TIMESTAMP_VALUE", r, 1),
      regexp_extract($"TIMESTAMP_VALUE", r, 3)
    ), "MM-dd-YYYY HH:mm:ssXXX"
  ).cast("decimal(38, 9)") + 
  // subsecond part
  regexp_extract($"TIMESTAMP_VALUE", r, 2).cast("decimal(38, 9)")).alias("value")
}

Seq("12-12-2015 14:09:36.992415+01:00").toDF("TIMESTAMP_VALUE")
  .select(to_nano($"TIMESTAMP_COLUMN").cast("timestamp"))
  .show(false)

// +--------------------------+
// |value                     |
// +--------------------------+
// |2014-12-28 14:09:36.992415|
// +--------------------------+

Here comes a dirty dirty trick without UDF to make this work if you don't care about nanoseconds. (I cannot use UDF where this is required, and cannot modify the source)

select CAST(UNIX_TIMESTAMP(substr(date,0,length(date)-4), "yyyy-MM-dd'T'HH:mm:ss.SSS") AS TIMESTAMP);

Eg

select CAST(UNIX_TIMESTAMP(substr("2020-09-14T01:14:15.596444Z",0,length("2020-09-14T01:14:15.596444Z")-4), "yyyy-MM-dd'T'HH:mm:ss.SSS") AS TIMESTAMP);

I'm basically stripping the string out of it's nanoseconds part, and parsing the rest with the spark SimpleDateFormat compatible parser.

Please future employer, don't judge me by this reply.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM