简体繁体 English

Hive分区的工作方式

[英]How Hive Partition works

原文 2016-04-07 08:04:25 2 2 hadoop/ hive/ partitioning

I wanna know how hive partitioning works I know the concept but I am trying to understand how its working and store the in exact partition. 我想知道配置单元分区是如何工作的，我知道这个概念，但是我想了解它的工作原理并将其存储在精确的分区中。 Let say I have a table and I have created partition on year its dynamic, ingested data from 2013 so how hive create partition and store the exact data in exact partition. 假设我有一个表，并且是从2013年开始按其动态提取的数据创建分区的，所以如何配置单元创建分区并将确切的数据存储在准确的分区中。

2 个解决方案

Hive organizes tables into partitions. Hive将表组织到分区中。 It is a way of dividing a table into related parts based on the values of partitioned columns such as date. 这是一种基于分区列的值（例如日期）将表划分为相关部分的方法。

Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria. 除了作为存储单元以外， Partitions还允许用户有效地标识满足特定条件的行。

Using partition, it is easy to query a portion of the data. 使用分区，很容易查询一部分数据。

Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. 表或分区被细分为存储桶，以为数据提供额外的结构，这些数据可用于更有效的查询。 Bucketing works based on the value of hash function of some column of a table. 存储桶基于表某列的哈希函数的值进行工作。

Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole table for the required information. 假设您需要检索所有2012年加入的员工的详细信息。查询将在整个表中搜索所需的信息。 However, if you partition the employee data with the year and store it in a separate file, it reduces the query processing time. 但是，如果按年份对员工数据进行分区并将其存储在单独的文件中，则会减少查询处理时间。

If the table is not partitioned, all the data is stored in one directory without order. 如果未对表进行分区，则所有数据将不按顺序存储在一个目录中。 If the table is partitioned(eg. by year) data are stored separately in different directories. 如果表已分区（例如按年），则数据分别存储在不同目录中。 Each directory is corresponding to one year. 每个目录对应一年。 For a non-partitioned table, when you want to fetch the data of year=2010, hive have to scan the whole table to find out the 2010-records. 对于非分区表，当您要获取year = 2010的数据时，配置单元必须扫描整个表以查找2010记录。 If the table is partitioned, hive just go to the year=2010 directory. 如果表已分区，配置单元只需转到year = 2010目录。 More faster and IO efficient 更快，IO效率更高