简体   繁体   中英

Slow loading of partitioned Hive table

I'm loading a table in Hive thats partitioned by date. It currently contains about 3 years worth of records, so circa 900 partitions (ie 365*3).

I'm loading daily deltas into this table, adding an additional partition per day. I achieve this using dynamic partitioning as I cant guarantee my source data only contains one days worth of data (eg if I'm recovering from a failure I may have multiple days of data to process).

This is all fine and dandy, however I've noticed the final step of actually writing the partition has become very slow. By this I mean the logs show the MapReduce stage completes quickly, its just very slow on the final step as it seems to scan and open all existing partitions regardless of if they will be overwritten.

Should I be explicitly creating partitions to avoid this step?

Whether the partitions are dynamic or static typically should not alter the performance drastically. Can you check in each of the partitions how many actual files are getting created? Just want to make sure that the actual writing is not serialized which it could be if it's writing to only one file. Also check how many mappers & reducers were employed by the job.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM