Spark SQL(Hive query through HiveContext) INSERT OVERWRITE is not overwriting existing data if multiple partition is present in hive table

Question

//Hive-1.2.1000.2.6.1.0-129 We are trying to INSERT OVERWRITE test5 table with multiple partitions.According to document( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML ) INSERT OVERWRITE will overwrite any existing data in the table or partition. But we are still getting some old data after INSERT OVERWRITE query is fired. Below is the sample execution and output.

//Spark-2.1.1 We are getting same out put when running through HiveContext in Spark-2.1.1

CREATE TABLE dbtest.test5 (emp_id INT) PARTITIONED BY (depart_id INT,depart_name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'externalpath'; 

INSERT INTO TABLE dbtest.test5  PARTITION (depart_id,depart_name) SELECT emp_id,depart_id,depart_name from dbtest.tempTableHive1; 

4       123     Dev 
5       123     Dev 
6       123     Test 
7       567     Test 

INSERT INTO TABLE dbtest.test5  PARTITION (depart_id,depart_name) SELECT emp_id,depart_id,depart_name from dbtest.tempTableHive2; 
4       123     Dev 
5       123     Dev 
1       123     Dev 
2       123     Dev 
6       123     Test 
3       123     Test 
7       567     Test 

INSERT OVERWRITE TABLE dbtest.test5  PARTITION (depart_id,depart_name) SELECT emp_id,depart_id,depart_name from dbtest.tempTableHive3; 

8       123     Dev 
9       123     Dev 
10      123     Dev 
6       123     Test 
3       123     Test 
7       567     Test

Is there any wrong with code or it is apache hive problem?

Answer 1

Hive will overwrite the partition, when you specify INSERT OVERWRITE. Please see below for my outputs from cloudera quick start VM.

hive> SELECT * FROM tempTableHive1;
OK
4   123 Dev
5   567 Test
Time taken: 0.048 seconds, Fetched: 2 row(s)
hive> INSERT INTO TABLE test5  PARTITION (depart_id,depart_name) SELECT emp_id,depart_id,depart_name from tempTableHive1; 

hive> SELECT * FROM test5;
OK
4   123 Dev
5   567 Test
Time taken: 0.065 seconds, Fetched: 2 row(s)

hive> SELECT * FROM tempTableHive2;
OK
4   123 Dev
6   123 Dev
Time taken: 0.047 seconds, Fetched: 2 row(s)

hive> INSERT INTO TABLE test5  PARTITION (depart_id,depart_name) 
    > SELECT emp_id,depart_id,depart_name from tempTableHive2; 

hive> SELECT * FROM test5;
OK
4   123 Dev
4   123 Dev
6   123 Dev
5   567 Test
Time taken: 0.057 seconds, Fetched: 4 row(s)

hive> SELECT * FROM tempTableHive3;
OK
100 123 Dev
101 123 Dev

hive> INSERT OVERWRITE TABLE test5  PARTITION (depart_id,depart_name) 
    > SELECT emp_id,depart_id,depart_name from tempTableHive3;

hive> SELECT * FROM test5;
OK
100 123 Dev
101 123 Dev
5   567 Test
Time taken: 0.072 seconds, Fetched: 3 row(s)

If you are still having trouble, the best way to debug this is to examine the HDFS files. There should be one file per department id/ department name combination. Example /user/hive/warehouse/test5/depart_id=123/depart_name=Dev. As they are text files, you will be able to quick "cat" them to see the contents. Let's know how you get on.

Spark SQL(Hive query through HiveContext) INSERT OVERWRITE is not overwriting existing data if multiple partition is present in hive table

Question

1 answers

solution1
0 2017-11-22 21:17:44

Spark SQL(Hive query through HiveContext) INSERT OVERWRITE is not overwriting existing data if multiple partition is present in hive table

Question

1 answers

solution1 0 2017-11-22 21:17:44

solution1
0 2017-11-22 21:17:44