简体   繁体   English

为什么在 Spark 中传递格式错误的时区字符串时 from_utc_timstamp 不抛出错误?

[英]Why doesn't from_utc_timstamp throw an error when passed a malformed timezone string in Spark?

When calling the from_utc_timestamp function in spark 2.4.3 no error is thrown if I pass in a malformed timezone string.在 spark 2.4.3 中调用 from_utc_timestamp function 时,如果我传入格式错误的时区字符串,则不会引发错误。 Instead, it just defaults to UTC, which is counter to my expectations, and also seems likely to cause mistakes to go unnoticed.相反,它只是默认为 UTC,这与我的预期背道而驰,而且似乎也可能导致 go 的错误被忽视。 Is this intentional, or is this a bug in Spark?这是故意的,还是 Spark 中的错误?

See example below:请参见下面的示例:

scala> val df =  Seq(("2020-01-01 00:00:00")).toDF("date")
df: org.apache.spark.sql.DataFrame = [date: string]

scala> df.show()

// Not a real timezone obviously. Just gets treated like UTC.
scala> df.withColumn("est", from_utc_timestamp(col("date"), "not_a_real_timezone")).show()
+-------------------+-------------------+
|               date|                est|
+-------------------+-------------------+
|2020-01-01 00:00:00|2020-01-01 00:00:00|
+-------------------+-------------------+

// Typo in EST5PDT, so still not a real timezone. Also defaults to UTC, which makes it
// very easy to miss this mistake.
scala> df.withColumn("est", from_utc_timestamp(col("date"), "EST5PDT")).show()
+-------------------+-------------------+
|               date|                est|
+-------------------+-------------------+
|2020-01-01 00:00:00|2020-01-01 00:00:00|
+-------------------+-------------------+

// EST8EDT is a real timezone, so this works as expected.
scala> df.withColumn("est", from_utc_timestamp(col("date"), "EST5EDT")).show()
+-------------------+-------------------+
|               date|                est|
+-------------------+-------------------+
|2020-01-01 00:00:00|2019-12-31 19:00:00|
+-------------------+-------------------+

The from_utc_timestamp uses DateTimeUtils from org.apache.spark.sql.catalyst.util . from_utc_timestamp使用来自org.apache.spark.sql.catalyst.utilDateTimeUtils To get the timezone they use getTimeZone method.要获取时区,他们使用getTimeZone方法。 This method generally does not throw an issue.这种方法通常不会引发问题。

  • This might be a JVM thing, where JVM tries to avoid dependence on the system's default locale, charset and timezone这可能是 JVM 的事情,其中 JVM 试图避免依赖系统的默认语言环境、字符集和时区
  • This might be a Spark issue to be logged in Jira这可能是在 Jira 中登录的 Spark 问题

But looking at code of other people, they do have a setup of checking first with:但是看看其他人的代码,他们确实有一个首先检查的设置:

import java.util.TimeZone

...

if (!TimeZone.getAvailableIDs().contains(tz)) {
  throw new IllegalStateException(s"The setting '$tz' is not recognized as known time zone")
}

EDIT1: Just found out it is a "feature". EDIT1:刚刚发现它是一个“功能”。 It is in the migration guide for 3.0.0它在 3.0.0 的迁移指南中

In Spark version 2.4 and earlier, invalid time zone ids are silently ignored and replaced by GMT time zone, for example, in the from_utc_timestamp function.在 Spark 版本 2.4 和更早版本中,无效的时区 ID 会被忽略并替换为 GMT 时区,例如,在 from_utc_timestamp function 中。 Since Spark 3.0, such time zone ids are rejected, and Spark throws java.time.DateTimeException.从 Spark 3.0 开始,此类时区 ID 被拒绝,并且 Spark 抛出 java.time.DateTimeException。

https://spark.apache.org/docs/3.0.0-preview/sql-migration-guide.html https://spark.apache.org/docs/3.0.0-preview/sql-migration-guide.html

from_utc_timestamp calls DateTimeUtils.fromUTCTime . from_utc_timestamp调用DateTimeUtils.fromUTCTime

def fromUTCTime(time: SQLTimestamp, timeZone: String): SQLTimestamp = {
  convertTz(time, TimeZoneGMT, getTimeZone(timeZone))
}

To transform the timezone string into an TimeZone object the function calls getTimeZone and here the JDK's TimeZone.getTimeZone is called to convert the timezone string into an actual timezone object.要将时区字符串转换为时object function 调用getTimeZone ,这里调用 JDK 的TimeZone.getTimeZone将时区字符串转换为实际时区 ZA8CFDE6331BD49EB216ZC96F966 The Javadoc of this method states that the method returns该方法的 Javadoc 声明该方法返回

the specified TimeZone, or the GMT zone if the given ID cannot be understood指定的时区,如果给定的 ID 无法理解,则为 GMT 时区

In your the case of EST5PDT no time zone for this string can be found and thus the return value is GMT .在您的EST5PDT情况下,找不到该字符串的时区,因此返回值为GMT As result the the SQLTimestamp is converted from GMT to GMT which means it stays unchanged.结果,SQLTimestamp 从 GMT 转换为 GMT,这意味着它保持不变。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 sparksql - 将带时区的字符串转换为 UTC - sparksql - convert string with timezone to UTC 为什么asInstanceOf不会抛出ClassCastException? - Why asInstanceOf doesn't throw a ClassCastException? Spark 2.3 (Scala) - 将时间戳列从 UTC 转换为另一列中指定的时区 - Spark 2.3 (Scala) - Convert a timestamp column from UTC to timezone specified in another column 我如何将系统时区设置为UTC,以便joda DateTime不会忽略它? - Ho do I set system timezone to UTC so that joda DateTime doesn't ignores it? 为什么我的验证在检查输入类型时不抛出异常? - Why doesn't my validation throw exception when it checks for the input type? 为什么编译器不会为不是 BooleanType 的 spark 列表达式引发错误? - Why doesn't the compiler raise an error for spark column expression not being of BooleanType? 如果Scala的==调用等于,那么为什么不抛出异常? - If Scala's == calls equals, why doesn't this throw an exception? 为什么null.asInstanceOf [Int]不抛出NullPointerException? - why doesn't null.asInstanceOf[Int] throw a NullPointerException? 为什么当我使用扩展应用程序时,火花广播无法正常工作? - Why spark broadcast doesn't work well when I use extends App? 在Slick中连接到SQLite数据库不起作用,但不会引发错误 - Connect to SQLite database in Slick doesn't work but doesn't throw an error
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM