Convert date time timestamp in spark dataframe to epocTimestamp

Question

I have parquet file with TimeStamp column in this format 2020-07-07 18:30:14.500000+00:00 written from pandas. When I'm reading the same parquet file in spark, it is being read as 2020-07-08 00:00:14.5 .

I wanted to convert this into epoch timestamp in milliseconds which is this 1594146614500

I have tried using java datetime format

val dtformat = new java.text.SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS")
dtformat.parse(r2.getAs[Long]("date_time").toString).getTime

It;s converting but wrong value(1594146614005) instead of 1594146614500.

To make it correct I have to add dtformat.parse(r2.getAs[Long]("date_time").toString+"00").getTime . Is there anyother cleaner approch than this?

Any function available in spark to read it as milliseconds?

update 1:

After using the below answer:

df.withColumn("timestamp", to_timestamp($"date_time", "yyyy-MM-dd HH:mm:ss.SSSSSSXXX")).withColumn("epoch", ($"timestamp".cast("decimal(20, 10)") * 1000).cast("bigint")).show()

+-------------+--------------------+-------------------+-------------+
|expected_time|       original_time|          timestamp|        epoch|
+-------------+--------------------+-------------------+-------------+
|1597763904500|2020-08-18 20:48:...|2020-08-18 20:48:24|1597763904000|
|1597763905000| 2020-08-18 20:48:25|2020-08-18 20:48:25|1597763905000|
|1597763905500|2020-08-18 20:48:...|2020-08-18 20:48:25|1597763905000|

drawback is suppose if data is at 500ms granularity, then each timestamp has two same epoc timestamp which is not expected.

Answer 1

I recommend you switch from the outdated error-prone date/time API from the java.util and the corresonding formatting API ( java.text.SimpleDateFormat ) to the modern date/time API from java.time and the corresponding formatting API ( java.time.format ). Learn more about the modern date-time API from Trail: Date Time

import java.time.OffsetDateTime;
import java.time.format.DateTimeFormatter;

public class Main {
    public static void main(String[] args) {
        OffsetDateTime odt = OffsetDateTime.parse("2020-07-07 18:30:14.500000+00:00",
                DateTimeFormatter.ofPattern("uuuu-MM-dd HH:mm:ss.SSSSSSZZZZZ"));
        System.out.println(odt.toInstant().toEpochMilli());
    }
}

Output:

1594146614500

Answer 2

With the spark dataframe functions,

df.withColumn("timestamp", to_timestamp($"time", "yyyy-MM-dd HH:mm:ss.SSSSSSXXX"))
  .withColumn("epoch", ($"timestamp".cast("decimal(20, 10)") * 1000).cast("bigint"))
  .show(false)

+--------------------------------+---------------------+-------------+
|time                            |timestamp            |epoch        |
+--------------------------------+---------------------+-------------+
|2020-07-07 18:30:14.500000+00:00|2020-07-07 18:30:14.5|1594146614500|
+--------------------------------+---------------------+-------------+

this is also possible way to do that.

Convert date time timestamp in spark dataframe to epocTimestamp

Question

2 answers

solution1
1 2020-08-14 21:20:52

solution2
1 2020-08-15 05:35:56

Convert date time timestamp in spark dataframe to epocTimestamp

Question

2 answers

solution1 1 2020-08-14 21:20:52

solution2 1 2020-08-15 05:35:56

solution1
1 2020-08-14 21:20:52

solution2
1 2020-08-15 05:35:56