简体   繁体   English

Spark SQL到Hive表 - 日期时间字段小时错误

[英]Spark SQL to Hive table - Datetime Field Hours Bug

I face this problem: When I enter in a timestamp field in Hive with spark.sql data, the hours are strangely changed to 21:00:00! 我遇到这个问题:当我使用spark.sql数据输入Hive中的时间戳字段时,小时数奇怪地改为21:00:00!

Let me explain: 让我解释:

I have a csv file that I read with spark.sql. 我有一个我用spark.sql读取的csv文件。 I read the file, convert it to dataframe and store it, in a Hive table. 我读取文件,将其转换为dataframe并将其存储在Hive表中。 One of the fields in this file is date in the format "3/10/2017". 此文件中的一个字段是“3/10/2017”格式的日期。 The field in Hive that I want to enter it, is in Timestamp format (the reason I use this data type instead of Date is that I want to query table with Impala and Impala has only Timestamp data type , so It is not a solution to simply change the data type to Date) 我希望输入Hive中的字段是Timestamp格式(我使用此数据类型而不是Date的原因是我想使用Impala查询表,而Impala只有Timestamp 数据类型 ,所以它不是一个解决方案只需将数据类型更改为Date)

As you can see from documentation the Hive Timestamp data type has "YYYY-MM-DD HH:MM:SS" format so before I enter the dataframe to the Hive Table I convert the date values to the appropriate format. 从文档中可以看出,Hive Timestamp数据类型具有“YYYY-MM-DD HH:MM:SS”格式,因此在我将数据帧输入到Hive表之前,我将日期值转换为适当的格式。

Here is my code in Python: 这是我在Python中的代码:

from datetime import datetime
from pyspark.sql.functions import udf

df = spark.read.csv("hdfs:/user/../MyFile.csv", header=True)

#Use a user defined function to convert date format
def DateConvert(x):
    x_augm = str(x)+" 00:00:00"
    datetime_object = datetime.strptime(x_augm,'%d/%m/%Y %H:%M:%S')
    return datetime_object.strftime('%Y-%m-%d %H:%M:%S')

DateConvert_udf = udf(DateConvert)

df= df.withColumn("Trans_Date", DateConvert_udf("Trans_Date"))

This properly formats the timestamp. 这样可以正确格式化时间戳。 When I run 我跑的时候

df.select("Trans_Date").show(10, False)

I get: 我明白了:

 +-------------------+ |Trans_Date | +-------------------+ |2017-10-16 00:00:00| |2017-10-16 00:00:00| |2017-10-16 00:00:00| |2017-10-16 00:00:00| |2017-10-16 00:00:00| |2017-10-16 00:00:00| |2017-10-16 00:00:00| |2017-10-16 00:00:00| |2017-10-16 00:00:00| |2017-10-16 00:00:00| +-------------------+ 

Then I import the data to Hive with Spark SQL like this 然后我像这样用Spark SQL将数据导入Hive

df.createOrReplaceTempView('tempTable')
spark.sql("insert into table db.table select * from tempTable")

My problem is that when I go to Hive my Timestamp field has values like: 我的问题是,当我去Hive时,我的Timestamp字段的值如下:

2017-10-16 21:00:00 2017-10-16 21:00:00

which is very peculiar! 这是非常奇特的!

Thanks in advance for any suggestion 提前感谢任何建议

This is the common problem while Saving data into Hive tables with TIMESTAMP Data type. 将数据保存到具有TIMESTAMP数据类型的Hive表时,这是常见问题。

When you save data into Hive table, TIMESTAMP values represent the local timezone of the host where the data was written. 将数据保存到Hive表时,TIMESTAMP值表示写入数据的主机的本地时区。

Here 2017-10-16 00:00:00 - UTC (By Default) got converted to 2017-10-16 21:00:00 - Local TimeZone of Hive host. 这里2017-10-16 00:00:00 - UTC(默认)转换为2017-10-16 21:00:00 - Hive主机的本地TimeZone。

To avoid undesired results from unexpected time zone issues, in Impala Timestamps are stored and interpreted relative to UTC, both when written to or read from data files. 为了避免意外时区问题产生意外结果,Impala时间戳会在写入数据文件或从数据文件读取时相对于UTC进行存储和解释。

You can refer below documentation for necessary configuration settings. 您可以参考以下文档以获取必要的配置设置。 https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_timestamp.html#timestamp https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_timestamp.html#timestamp

By adding Floating Point digits while creating the timestamp in Spark I was able to solve this. 通过在Spark中创建时间戳时添加浮点数,我能够解决这个问题。 I simply formatted hours in HH:MM:SS.ff format and now time in Hive table shows as 00:00:00 which is what I wanted. 我只是用HH:MM:SS.ff格式格式化了几个小时,现在Hive表中的时间显示为00:00:00,这就是我想要的。

My new Date Conversion routine is: 我的新日期转换例程是:

def DateConvert(x):
    x_augm = str(x)+" 00:00:00.0"
    datetime_object = datetime.strptime(x_augm,'%d/%m/%Y %H:%M:%S.%f')
    return datetime_object.strftime('%Y-%m-%d %H:%M:%S.%f')  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM