简体   繁体   中英

Spark using timestamp inside a RDD

I'm trying to compare timestamps within a map, but Spark seems to be using a different timezone or something else that is really weird. I read a dummy csv file like the following to build the input dataframe :

"ts"
"1970-01-01 00:00:00"
"1970-01-01 00:00:00"
df.show(2)
+-------------------+
|        ts         |
+-------------------+
|1970-01-01 00:00:00|
|1970-01-01 00:00:00|
+-------------------+

For now, nothing to report, but then :

df.rdd.map { row =>
  val timestamp = row.getTimestamp(0)
  val timestampMilli = timestamp.toInstant.toEpochMilli
  val epoch = Timestamp.from(Instant.EPOCH)
  val epochMilli = epoch.toInstant.toEpochMilli
  (timestamp, timestampMilli, epoch, epochMilli)
}.foreach(println)
(1970-01-01 00:00:00.0,-3600000,1970-01-01 01:00:00.0,0)
(1970-01-01 00:00:00.0,-3600000,1970-01-01 01:00:00.0,0)

I don't understand why both timestamp are not 1970-01-01 00:00:00.0, 0 . Anyone know what I'm missing ?

NB : I already setup the session timezone to UTC , using the following properties.

spark.sql.session.timeZone=UTC
user.timezone=UTC

The java.sql.Timestamp class inherits from java.util.Date . They both have the behavior of storing a UTC-based numeric timestamp, but displaying time in the local time zone. You'd see this with .toString() in Java, the same as you're seeing with println in the code given.

I believe your OS (or environment) is set to something similar to Europe/London . Keep in mind that at the Unix epoch ( 1970-01-01T00:00:00Z ), London was on BST (UTC+1).

Your timestampMilli variable is showing -3600000 because it's interpreted your input in local time as 1970-01-01T00:00:00+01:00 , which is equivalent to 1969-12-31T23:00:00Z .

Your epoch variable is showing 1970-01-01 01:00:00.0 because 0 is equivalent to 1970-01-01T00:00:00Z , which is equivalent to 1970-01-01T01:00:00+01:00 .

See also:

I do see you noted you set your session time zone to UTC, which in theory should work. But clearly the results are showing that it isn't using that. Sorry, but I don't know Spark well enough to tell you why. But I would focus on that part of the problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM