简体   繁体   中英

How to combine multiple ORC files (belonging to each partition) in a Partitioned Hive ORC table into a single big ORC file

I have a partitioned ORC table in Hive. After loading the table with all possible partitions I get on HDFS - multiple ORC files ie each partition directory on HDFS has an ORC file in it. I need to combine all these ORC files under each partition to a single big ORC file for some use-case.

Can someone suggest me a way to combine these multiple ORC files (belonging to each partition) into a single big ORC file.

I've tried creating a new Non Partitioned ORC table from the Partitioned table.. It does reduce the number of files but not to a single file.

PS: Creating a table out of another one is a completely a map task and hence setting the number of reducers to 1 using the property 'set mapred.reduce.tasks=1;' doesn't help.

Thanks

You can use the CONCATENATE command to combine the small orc files. This can be done at table as well as partition level: The syntax as per the orc documentation:

users can request an efficient merge of small ORC files together by issuing a CONCATENATE command on their table or partition. The files will be merged at the stripe level without reserialization.

ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM