简体   繁体   中英

How to manually add partition details into hive metastore tables?

In my HDFS, I've partitioned data by date and event_id , and have about 1.4 million parquet files. Today, to analyze the data in Apache Spark, I use spark.read.parquet("/path/to/root/") . This takes about 30 minutes to list the files, I have to do this every time, and it's getting annoying.

Now, I want to setup an external table, using MySQL as the Hive Metastore. I'm currently facing the know issue where discovering all 1.4 partitions taking forever. As we all known MSCK REPAIR TABLE my_table is out of the picture. I instead generated a long query ( about 400 MB ) that contains this query like this


ALTER TABLE my_table ADD 
  PARTITION (date = 'YYYY-MM-DD', event_id = "<some value>") LOCATION ''
  PARTITION (date = 'YYYY-MM-DD', event_id = "<some value>") LOCATION ''
  PARTITION (date = 'YYYY-MM-DD', event_id = "<some value>") LOCATION ''
  ...
  PARTITION (date = 'YYYY-MM-DD', event_id = "<some value>") LOCATION ''
  PARTITION (date = 'YYYY-MM-DD', event_id = "<some value>") LOCATION ''

It has been 3 hours, and it still has only processed less than 100,000 partitions. I have observed a few things:

  1. Spark does one partition at a time.
  2. Spark seems to check each path for existence.

All these adds to the long running time. I've searched, and haven't been able to find how to disable both operations.

So, I want to manually perform SQL operations against the MySQL database and table for the Hive metastore, to create and manage the tables. I've looked but have been unable to figure out how to manually manage those tables. Please, does anyone know how to do that? Specifically, I want the following:

  1. How can can create an external table with partitions, by making direct entries into the Hive metastore tables?
  2. How can I manage an External table partition by making direct upsert queries against the Hive metastore tables?

Is there a good resource I could use to learn about the backing tables in the metastore. I feel doing the inserts manually would be much much faster. Thank you.

I think the core problem here is that you have too many partitions. Partitioning should generally be done on a low-cardinality column (something with a relatively small number of distinct values, compared to the total number of records). Typically you want to err on the side of having a smaller number of large files, rather than a large number of small files.

In your example, date is probably a good partitioning column, assuming there are many records for each date. If there are a large number of distinct values for event_id , that's not a great candidate for partitioning. Just keep it as an unpartitioned column.

An alternative to partitioning for a high-cardinality column is bucketing . This groups similar values for the bucketed column so they're in the same file, but doesn't split each value across separate files. The AWS Athena docs have a good overview of the concept.

This can be an issue with statistics auto-gathering. As a workaround, switch off hive.stats.autogather before recovering partitions.

  1. Switch-off statistics auto-gathering:

    set hive.stats.autogather=false;

  2. Run MSCK REPAIR or ALTER TABLE RECOVER PARTITIONS.

If you need statistics to be fresh, you can execute ANALYZE separately for new partitions only.

Related tickets are HIVE-18743 , HIVE-19489 , HIVE-17478

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM