简体   繁体   中英

Cloudera Impala INVALIDATE METADATA

As has been discussed in impala tutorials, Impala uses a Metastore shared by Hive. but has been mentioned that if you create or do some editions on tables using hive, you should execute INVALIDATE METADATA or REFRESH command to inform impala about changes.

So I've got confused and my question is: if the Database of Metadata is shared, why there is a need for executing INVALIDATE METADATA or REFRESH by impala?

and if it is for caching of metadata by impala, why the daemons do not update their cache in the occurrence of cache miss themselves and without need to refresh metadata manually?

any help is appreciated.

Ok! Let's start with your question in the comment that what is the benefit of a centralized meta store.

Having a central meta store don't require the user to maintain meta data at two different locations, one each for Hive and Impala. User can have a central repository and both the tools can access this location for any metadata information.

Now, the second part, why there is a need to do INVALIDATE METADATA or REFRESH when the meta store is shared?

Impala utilizes Massively Parallel Processing paradigm to get the work done. Instead of reading from the centralized meta store for each and every query, it tends to keep the metadata with executor nodes so that it can completely bypass the COLD STARTS where a significant amount of time may be spent in reading the metadata.

INVALIDATE METADATA/REFRESH propagates the metadata/block information to the executor nodes.

Why do it manually?

In the earlier version of Impala, catalogd process was not present. The meta data updates were need to be propagated via the aforementioned commands. Starting Impala 1.2, catalogd is added and this process relays the metadata changes from Impala SQL statements to all the nodes in a cluster.

Hence removing the need to do it manually!

Hope that helps.

It is shared, but Impala caches the metadata and uses its statistics in its optimizer, but if it's changed in hive, you have to manually tell impala to refresh its cache, which is kind of inconvenient. But if you create/change tables in impala, you don't have to do anything on the hive side.

@masoumeh when you modify a table via Impala SQL statements no need for INVALIDATE METADATA or REFRESH , this job is done by catalogd . But when you insert :

  1. a NEW table through HIVE ie sqoop import .... --hive-import ... then you have to do : INVALIDATE METADATA tableName via Impala-Shell.

  2. new data files into an existing table (append data) then you have to : REFRESH tableName because the only thing you want is the metadata for the last added info.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM