[英]Why doesn't from_utc_timstamp throw an error when passed a malformed timezone string in Spark?
When calling the from_utc_timestamp function in spark 2.4.3 no error is thrown if I pass in a malformed timezone string.在 spark 2.4.3 中调用 from_utc_timestamp function 时,如果我传入格式错误的时区字符串,则不会引发错误。 Instead, it just defaults to UTC, which is counter to my expectations, and also seems likely to cause mistakes to go unnoticed.相反,它只是默认为 UTC,这与我的预期背道而驰,而且似乎也可能导致 go 的错误被忽视。 Is this intentional, or is this a bug in Spark?这是故意的,还是 Spark 中的错误?
See example below:请参见下面的示例:
scala> val df = Seq(("2020-01-01 00:00:00")).toDF("date")
df: org.apache.spark.sql.DataFrame = [date: string]
scala> df.show()
// Not a real timezone obviously. Just gets treated like UTC.
scala> df.withColumn("est", from_utc_timestamp(col("date"), "not_a_real_timezone")).show()
+-------------------+-------------------+
| date| est|
+-------------------+-------------------+
|2020-01-01 00:00:00|2020-01-01 00:00:00|
+-------------------+-------------------+
// Typo in EST5PDT, so still not a real timezone. Also defaults to UTC, which makes it
// very easy to miss this mistake.
scala> df.withColumn("est", from_utc_timestamp(col("date"), "EST5PDT")).show()
+-------------------+-------------------+
| date| est|
+-------------------+-------------------+
|2020-01-01 00:00:00|2020-01-01 00:00:00|
+-------------------+-------------------+
// EST8EDT is a real timezone, so this works as expected.
scala> df.withColumn("est", from_utc_timestamp(col("date"), "EST5EDT")).show()
+-------------------+-------------------+
| date| est|
+-------------------+-------------------+
|2020-01-01 00:00:00|2019-12-31 19:00:00|
+-------------------+-------------------+
The from_utc_timestamp
uses DateTimeUtils
from org.apache.spark.sql.catalyst.util
. from_utc_timestamp
使用来自org.apache.spark.sql.catalyst.util
的DateTimeUtils
。 To get the timezone they use getTimeZone
method.要获取时区,他们使用getTimeZone
方法。 This method generally does not throw an issue.这种方法通常不会引发问题。
But looking at code of other people, they do have a setup of checking first with:但是看看其他人的代码,他们确实有一个首先检查的设置:
import java.util.TimeZone
...
if (!TimeZone.getAvailableIDs().contains(tz)) {
throw new IllegalStateException(s"The setting '$tz' is not recognized as known time zone")
}
EDIT1: Just found out it is a "feature". EDIT1:刚刚发现它是一个“功能”。 It is in the migration guide for 3.0.0它在 3.0.0 的迁移指南中
In Spark version 2.4 and earlier, invalid time zone ids are silently ignored and replaced by GMT time zone, for example, in the from_utc_timestamp function.在 Spark 版本 2.4 和更早版本中,无效的时区 ID 会被忽略并替换为 GMT 时区,例如,在 from_utc_timestamp function 中。 Since Spark 3.0, such time zone ids are rejected, and Spark throws java.time.DateTimeException.从 Spark 3.0 开始,此类时区 ID 被拒绝,并且 Spark 抛出 java.time.DateTimeException。
https://spark.apache.org/docs/3.0.0-preview/sql-migration-guide.html https://spark.apache.org/docs/3.0.0-preview/sql-migration-guide.html
from_utc_timestamp
calls DateTimeUtils.fromUTCTime . from_utc_timestamp
调用DateTimeUtils.fromUTCTime 。
def fromUTCTime(time: SQLTimestamp, timeZone: String): SQLTimestamp = {
convertTz(time, TimeZoneGMT, getTimeZone(timeZone))
}
To transform the timezone string into an TimeZone object the function calls getTimeZone and here the JDK's TimeZone.getTimeZone is called to convert the timezone string into an actual timezone object.要将时区字符串转换为时区object function 调用getTimeZone ,这里调用 JDK 的TimeZone.getTimeZone将时区字符串转换为实际时区 ZA8CFDE6331BD49EB216ZC96F966 The Javadoc of this method states that the method returns该方法的 Javadoc 声明该方法返回
the specified TimeZone, or the GMT zone if the given ID cannot be understood指定的时区,如果给定的 ID 无法理解,则为 GMT 时区
In your the case of EST5PDT
no time zone for this string can be found and thus the return value is GMT
.在您的EST5PDT
情况下,找不到该字符串的时区,因此返回值为GMT
。 As result the the SQLTimestamp is converted from GMT to GMT which means it stays unchanged.结果,SQLTimestamp 从 GMT 转换为 GMT,这意味着它保持不变。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.