简体   繁体   中英

Does Hive's Create External Table copy data?

I have a Spark application that writes output files in Avro format. Now I would like that data to be available in Hive, because an application which would utilise that data can only do so through a Hive table.

It is described here that one can do that by using CREATE EXTERNAL TABLE in Hive. Now my question is, how efficient is the CREATE EXTERNAL TABLE method. Would it copy all the Avro data somewhere else on the HDFS to work, or does it just create some metainfo , which it can use to query Avro data?

Also, what if I want to keep on adding new Avro data to that table. Can I create such an external table once, and then keep adding the new Avro data to it? Also what if someone queries the data while it's being updated. Does it allow atomic transactions?

Hive CREATE TABLE statement does not copy any data. Data remains in the location specified in the table DDL. CREATE TABLE creates metadata only in Hive metastore.

You can add files later in the same location.

HDFS does not allow updates. You can delete files and put new files. select will return empty dataset in the middle between delete and putting new files.

If it is S3 filesystem and you are rewriting the same files or deleting them then eventual consistency issue may happen (file not found, etc).

Also when you manipulate files directly, Hive statistics is not refreshed because Hive does not know that you have changed the data.

Hive does not know if you changed files because filesystem and Hive are loosely connected. Hive has a metadata with table schema definition, serde and location, statistics, etc. And it remains the same after you changed data in the table location.

Hive transactions are atomic. If you inserting or rewriting data using HiveQL, it writes data into temporary location and only if the command succeeds files are moved to the table location (old files are deleted in case of rewrite). If SQL fails the data remains as it was before command.

But since Hive does not copy data from the table location into some internal managed storage, if you manipulating files when Hive is reading them, then it will be an exception in hive process. Hive cannot lock table during your files operation because Hive does not know about it. Filesystem is quite detached from hive and you can do everything in filesystem as if there is no Hive exists at all.

Read also about Hive ACID mode: Hive Transactions

Also read about the difference between managed and external tables in Hive .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM