为什么在 Spark 中传递格式错误的时区字符串时 from_utc_timstamp 不抛出错误？

Question

When calling the from_utc_timestamp function in spark 2.4.3 no error is thrown if I pass in a malformed timezone string.在 spark 2.4.3 中调用 from_utc_timestamp function 时，如果我传入格式错误的时区字符串，则不会引发错误。 Instead, it just defaults to UTC, which is counter to my expectations, and also seems likely to cause mistakes to go unnoticed.相反，它只是默认为 UTC，这与我的预期背道而驰，而且似乎也可能导致 go 的错误被忽视。 Is this intentional, or is this a bug in Spark?这是故意的，还是 Spark 中的错误？

See example below:请参见下面的示例：

scala> val df =  Seq(("2020-01-01 00:00:00")).toDF("date")
df: org.apache.spark.sql.DataFrame = [date: string]

scala> df.show()

// Not a real timezone obviously. Just gets treated like UTC.
scala> df.withColumn("est", from_utc_timestamp(col("date"), "not_a_real_timezone")).show()
+-------------------+-------------------+
|               date|                est|
+-------------------+-------------------+
|2020-01-01 00:00:00|2020-01-01 00:00:00|
+-------------------+-------------------+

// Typo in EST5PDT, so still not a real timezone. Also defaults to UTC, which makes it
// very easy to miss this mistake.
scala> df.withColumn("est", from_utc_timestamp(col("date"), "EST5PDT")).show()
+-------------------+-------------------+
|               date|                est|
+-------------------+-------------------+
|2020-01-01 00:00:00|2020-01-01 00:00:00|
+-------------------+-------------------+

// EST8EDT is a real timezone, so this works as expected.
scala> df.withColumn("est", from_utc_timestamp(col("date"), "EST5EDT")).show()
+-------------------+-------------------+
|               date|                est|
+-------------------+-------------------+
|2020-01-01 00:00:00|2019-12-31 19:00:00|
+-------------------+-------------------+

Answer 1

The from_utc_timestamp uses DateTimeUtils from org.apache.spark.sql.catalyst.util . from_utc_timestamp使用来自org.apache.spark.sql.catalyst.util的DateTimeUtils 。 To get the timezone they use getTimeZone method.要获取时区，他们使用getTimeZone方法。 This method generally does not throw an issue.这种方法通常不会引发问题。

This might be a JVM thing, where JVM tries to avoid dependence on the system's default locale, charset and timezone这可能是 JVM 的事情，其中 JVM 试图避免依赖系统的默认语言环境、字符集和时区
This might be a Spark issue to be logged in Jira这可能是在 Jira 中登录的 Spark 问题

But looking at code of other people, they do have a setup of checking first with:但是看看其他人的代码，他们确实有一个首先检查的设置：

import java.util.TimeZone

...

if (!TimeZone.getAvailableIDs().contains(tz)) {
  throw new IllegalStateException(s"The setting '$tz' is not recognized as known time zone")
}

EDIT1: Just found out it is a "feature". EDIT1：刚刚发现它是一个“功能”。 It is in the migration guide for 3.0.0它在 3.0.0 的迁移指南中

In Spark version 2.4 and earlier, invalid time zone ids are silently ignored and replaced by GMT time zone, for example, in the from_utc_timestamp function.在 Spark 版本 2.4 和更早版本中，无效的时区 ID 会被忽略并替换为 GMT 时区，例如，在 from_utc_timestamp function 中。 Since Spark 3.0, such time zone ids are rejected, and Spark throws java.time.DateTimeException.从 Spark 3.0 开始，此类时区 ID 被拒绝，并且 Spark 抛出 java.time.DateTimeException。

https://spark.apache.org/docs/3.0.0-preview/sql-migration-guide.html https://spark.apache.org/docs/3.0.0-preview/sql-migration-guide.html

Answer 2

from_utc_timestamp calls DateTimeUtils.fromUTCTime . from_utc_timestamp调用DateTimeUtils.fromUTCTime 。

def fromUTCTime(time: SQLTimestamp, timeZone: String): SQLTimestamp = {
  convertTz(time, TimeZoneGMT, getTimeZone(timeZone))
}

To transform the timezone string into an TimeZone object the function calls getTimeZone and here the JDK's TimeZone.getTimeZone is called to convert the timezone string into an actual timezone object.要将时区字符串转换为时区object function 调用getTimeZone ，这里调用 JDK 的TimeZone.getTimeZone将时区字符串转换为实际时区 ZA8CFDE6331BD49EB216ZC96F966 The Javadoc of this method states that the method returns该方法的 Javadoc 声明该方法返回

the specified TimeZone, or the GMT zone if the given ID cannot be understood指定的时区，如果给定的 ID 无法理解，则为 GMT 时区

In your the case of EST5PDT no time zone for this string can be found and thus the return value is GMT .在您的EST5PDT情况下，找不到该字符串的时区，因此返回值为GMT 。 As result the the SQLTimestamp is converted from GMT to GMT which means it stays unchanged.结果，SQLTimestamp 从 GMT 转换为 GMT，这意味着它保持不变。

为什么在 Spark 中传递格式错误的时区字符串时 from_utc_timstamp 不抛出错误？

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-06-17 20:56:13

解决方案2
1 2020-06-17 20:58:54

为什么在 Spark 中传递格式错误的时区字符串时 from_utc_timstamp 不抛出错误？

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-06-17 20:56:13

解决方案2 1 2020-06-17 20:58:54

解决方案1
1 已采纳 2020-06-17 20:56:13

解决方案2
1 2020-06-17 20:58:54