简体   繁体   中英

Spark SQL to Hive table - Datetime Field Hours Bug

I face this problem: When I enter in a timestamp field in Hive with spark.sql data, the hours are strangely changed to 21:00:00!

Let me explain:

I have a csv file that I read with spark.sql. I read the file, convert it to dataframe and store it, in a Hive table. One of the fields in this file is date in the format "3/10/2017". The field in Hive that I want to enter it, is in Timestamp format (the reason I use this data type instead of Date is that I want to query table with Impala and Impala has only Timestamp data type , so It is not a solution to simply change the data type to Date)

As you can see from documentation the Hive Timestamp data type has "YYYY-MM-DD HH:MM:SS" format so before I enter the dataframe to the Hive Table I convert the date values to the appropriate format.

Here is my code in Python:

from datetime import datetime
from pyspark.sql.functions import udf

df = spark.read.csv("hdfs:/user/../MyFile.csv", header=True)

#Use a user defined function to convert date format
def DateConvert(x):
    x_augm = str(x)+" 00:00:00"
    datetime_object = datetime.strptime(x_augm,'%d/%m/%Y %H:%M:%S')
    return datetime_object.strftime('%Y-%m-%d %H:%M:%S')

DateConvert_udf = udf(DateConvert)

df= df.withColumn("Trans_Date", DateConvert_udf("Trans_Date"))

This properly formats the timestamp. When I run

df.select("Trans_Date").show(10, False)

I get:

 +-------------------+ |Trans_Date | +-------------------+ |2017-10-16 00:00:00| |2017-10-16 00:00:00| |2017-10-16 00:00:00| |2017-10-16 00:00:00| |2017-10-16 00:00:00| |2017-10-16 00:00:00| |2017-10-16 00:00:00| |2017-10-16 00:00:00| |2017-10-16 00:00:00| |2017-10-16 00:00:00| +-------------------+ 

Then I import the data to Hive with Spark SQL like this

df.createOrReplaceTempView('tempTable')
spark.sql("insert into table db.table select * from tempTable")

My problem is that when I go to Hive my Timestamp field has values like:

2017-10-16 21:00:00

which is very peculiar!

Thanks in advance for any suggestion

This is the common problem while Saving data into Hive tables with TIMESTAMP Data type.

When you save data into Hive table, TIMESTAMP values represent the local timezone of the host where the data was written.

Here 2017-10-16 00:00:00 - UTC (By Default) got converted to 2017-10-16 21:00:00 - Local TimeZone of Hive host.

To avoid undesired results from unexpected time zone issues, in Impala Timestamps are stored and interpreted relative to UTC, both when written to or read from data files.

You can refer below documentation for necessary configuration settings. https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_timestamp.html#timestamp

By adding Floating Point digits while creating the timestamp in Spark I was able to solve this. I simply formatted hours in HH:MM:SS.ff format and now time in Hive table shows as 00:00:00 which is what I wanted.

My new Date Conversion routine is:

def DateConvert(x):
    x_augm = str(x)+" 00:00:00.0"
    datetime_object = datetime.strptime(x_augm,'%d/%m/%Y %H:%M:%S.%f')
    return datetime_object.strftime('%Y-%m-%d %H:%M:%S.%f')  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM