简体   繁体   中英

how to merge data while loading them into hive?

I'm tring to use hive to analysis our log, and I have a question.

Assume we have some data like this:
A 1
A 1
A 1
B 1
C 1
B 1
How can I make it like this in hive table(order is not important, I just want to merge them) ?
A 1
B 1
C 1

without pre-process it with awk/sed or something like that?
Thanks!

Step 1: Create a Hive table for input data set . create table if not exists table1 (fld1 string, fld2 string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\t'; (i assumed field seprator is \\t, you can replace it with actual separator)

Step 2 : Run below to get the merge data you are looking for

create table table2 as select fld1,fld2 from table1 group by fld1,fld2 ;

I tried this for below input set

hive (default)> select * from table1; OK

A 1

A 1

A 1

B 1

C 1

B 1

create table table4 as select fld1,fld2 from table1 group by fld1,fld2 ;

hive (default)> select * from table4; OK

A 1

B 1

C 1

You can use external table as well , but for simplicity I have used managed table here.

One idea.. you could create a table around the first file (called 'oldtable').

Then run something like this.... create table newtable select field1, max(field) from oldtable group by field1;

Not sure I have the syntax right, but the idea is to get unique values of the first field, and only one of the second. Make sense?

For merging the data, we can also use "UNION ALL" , it can also merge two different types of datatypes.

insert overwrite into table test1 (select x.* from t1 x ) UNION ALL (select y.* from t2 y);

here we are merging two tables data (t1 and t2) into one single table test1.

There's no way to pre-process the data while it's being loaded without using an external program. You could use a view if you'd like to keep the original data intact.

hive> SELECT * FROM table1;
OK
A       1
A       1
A       1
B       1
C       1
B       1
B       2 # Added to show it will group correctly with different values

hive> CREATE VIEW table2 (fld1, fld2) AS SELECT fld1, fld2 FROM table1 GROUP BY fld1, fld2;
hive> SELECT * FROM table2;
OK
A       1
B       1
B       2
C       1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM