简体繁体中英

Does Hive's Create External Table copy data?

原文 2020-05-29 09:12:39 5 1 hadoop/ hive/ avro/ atomicity/ external-tables

I have a Spark application that writes output files in Avro format. Now I would like that data to be available in Hive, because an application which would utilise that data can only do so through a Hive table.

It is described here that one can do that by using CREATE EXTERNAL TABLE in Hive. Now my question is, how efficient is the CREATE EXTERNAL TABLE method. Would it copy all the Avro data somewhere else on the HDFS to work, or does it just create some metainfo , which it can use to query Avro data?

Also, what if I want to keep on adding new Avro data to that table. Can I create such an external table once, and then keep adding the new Avro data to it? Also what if someone queries the data while it's being updated. Does it allow atomic transactions?

1 answers

Hive CREATE TABLE statement does not copy any data. Data remains in the location specified in the table DDL. CREATE TABLE creates metadata only in Hive metastore.

You can add files later in the same location.

HDFS does not allow updates. You can delete files and put new files. select will return empty dataset in the middle between delete and putting new files.

If it is S3 filesystem and you are rewriting the same files or deleting them then eventual consistency issue may happen (file not found, etc).

Also when you manipulate files directly, Hive statistics is not refreshed because Hive does not know that you have changed the data.

Hive does not know if you changed files because filesystem and Hive are loosely connected. Hive has a metadata with table schema definition, serde and location, statistics, etc. And it remains the same after you changed data in the table location.

Hive transactions are atomic. If you inserting or rewriting data using HiveQL, it writes data into temporary location and only if the command succeeds files are moved to the table location (old files are deleted in case of rewrite). If SQL fails the data remains as it was before command.

But since Hive does not copy data from the table location into some internal managed storage, if you manipulating files when Hive is reading them, then it will be an exception in hive process. Hive cannot lock table during your files operation because Hive does not know about it. Filesystem is quite detached from hive and you can do everything in filesystem as if there is no Hive exists at all.

Read also about Hive ACID mode: Hive Transactions

Also read about the difference between managed and external tables in Hive .

Why does Hive “create external table”, on S3, store data under subfolder “-ext-10000”?

Does hive create separate copy of data

When you create an external table in Hive with an S3 location is the data transfered?

What's wrong with this Hive query to create an external table?

Merge delta data into an external table using hive's merge statement

How to specify filesize of Hive EXTERNAL TABLE data on S3

Add data into Hive external table

Create an external table in HIVE with multiple sources

how to create dataframe from hive external table

How to create external table to DynamoDB on Hive 3.1

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Why does Hive “create external table”, on S3, store data under subfolder “-ext-10000”? Does hive create separate copy of data When you create an external table in Hive with an S3 location is the data transfered? What's wrong with this Hive query to create an external table? Merge delta data into an external table using hive's merge statement How to specify filesize of Hive EXTERNAL TABLE data on S3 Add data into Hive external table Create an external table in HIVE with multiple sources how to create dataframe from hive external table How to create external table to DynamoDB on Hive 3.1

Related Tags

Does Hive's Create External Table copy data?

Question

1 answers

solution1 1 ACCPTED 2020-05-29 14:08:22

solution1
1 ACCPTED 2020-05-29 14:08:22