[英]Timestamp data value different between Hive tables and databricks delta tables
We have done binary copy of data from Hive to ADLS with checksum validated.我们已经完成了从 Hive 到 ADLS 的数据二进制副本,并验证了校验和。 While values across every datatype matches however timestamp datatype columns are showing change in value between Hive and Delta(Azure Databricks) tables.
虽然每个数据类型的值都匹配,但时间戳数据类型列显示 Hive 和 Delta(Azure Databricks) 表之间的值变化。
select abcdtstmp from xyz.abc where mn_ID = "sdsdsd-7878-0016"
2018-01-16 00:00:00.0 (on prem)
select abcdtstmp from xyz.abc where mn_ID = "sdsdsd-7878-0016"
2018-01-16T05:00:00.000+0000(DBX)
While checksum and all validation does match, however some values getting added after 'T' is causing concern.虽然校验和和所有验证确实匹配,但是在“T”之后添加的一些值引起了关注。 Any suggestion would be helpful
任何建议都会有所帮助
This seems to be related to timezone
and hive.这似乎与
timezone
和hive有关。
Hive always thinks that timestamps in Parquet files are stored in UTC and it will convert them to a local system time (cluster host time) when it outputs. Hive一直认为Parquet文件中的时间戳是UTC格式的,输出的时候会转换成本地系统时间(集群主机时间)。 So, even if you are transferring data from EST to EST, its hive that is the culprit.
因此,即使您将数据从 EST 传输到 EST,它的 hive 也是罪魁祸首。
You can follow this link if you have hive version higher than 1.2 - https://issues.apache.org/jira/browse/HIVE-9482 set hive.parquet.timestamp.skip.conversion=true
Else, you need to manually convert the data back to EST or whatever timezone you want using below sql.如果您的 hive 版本高于 1.2,您可以点击此链接 - https://issues.apache.org/jira/browse/HIVE-9482设置
hive.parquet.timestamp.skip.conversion=true
否则,您需要手动转换数据返回 EST 或您想要使用的任何时区低于 sql。
from_utc_timestamp(to_utc_timestamp(my_dt_tm,'America/New_York'),'America/Denver') AS local_time
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.