convert string with nanosecond into timestamp in spark

Is there a way to convert a timestamp value with nano seconds to timestamp in spark. I get the input from a csv file and the timstamp value is of format 12-12-2015 14:09:36.992415+01:00 . This is the code I tried.

val date_raw_data = List((1, "12-12-2015 14:09:36.992415+01:00"))

val dateraw_df = sc.parallelize(date_raw_data).toDF("ID", "TIMESTAMP_VALUE")

val ts = unix_timestamp($"TIMESTAMP_VALUE", "MM-dd-yyyy HH:mm:ss.ffffffz").cast("double").cast("timestamp")

val date_df = dateraw_df.withColumn("TIMESTAMP_CONV", ts).show(false)

The output is

|1  |12-12-2015 14:09:36.992|null                 |

I was able to convert a time stamp with millisecond using format MM-dd-yyyy HH:mm:ss.SSS . Trouble is with nano second and timezone formats.

unix_timestamp won't do here. Even if you could parse the string (AFAIK SimpleDateFormat doesn't provide required formats), unix_timestamp has only second precision (emphasis mine):

def unix_timestamp(s: Column, p: String): Column

Convert time string with given pattern (see [ http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html ]) to Unix time stamp ( in seconds ), return null if fail.

You have to create your own function to parse this data. A rough idea:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column

def to_nano(c: Column) = {
  val r = "([0-9]{2}-[0-9]{2}-[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2})(\\.[0-9]*)(.*)$"
  // seconds part
      regexp_extract($"TIMESTAMP_VALUE", r, 1),
      regexp_extract($"TIMESTAMP_VALUE", r, 3)
    ), "MM-dd-YYYY HH:mm:ssXXX"
  ).cast("decimal(38, 9)") + 
  // subsecond part
  regexp_extract($"TIMESTAMP_VALUE", r, 2).cast("decimal(38, 9)")).alias("value")

Seq("12-12-2015 14:09:36.992415+01:00").toDF("TIMESTAMP_VALUE")

// +--------------------------+
// |value                     |
// +--------------------------+
// |2014-12-28 14:09:36.992415|
// +--------------------------+

Here comes a dirty dirty trick without UDF to make this work if you don't care about nanoseconds. (I cannot use UDF where this is required, and cannot modify the source)

select CAST(UNIX_TIMESTAMP(substr(date,0,length(date)-4), "yyyy-MM-dd'T'HH:mm:ss.SSS") AS TIMESTAMP);


select CAST(UNIX_TIMESTAMP(substr("2020-09-14T01:14:15.596444Z",0,length("2020-09-14T01:14:15.596444Z")-4), "yyyy-MM-dd'T'HH:mm:ss.SSS") AS TIMESTAMP);

I'm basically stripping the string out of it's nanoseconds part, and parsing the rest with the spark SimpleDateFormat compatible parser.

Please future employer, don't judge me by this reply.

