简体   繁体   English

在HIVE中进行分类时如何进行数据分发?

[英]How does data distribution happens in bucketing in HIVE?

I have created a table as below with 3 buckets, and loaded some data into it. 我创建了一个带有3个存储桶的下表,并将一些数据加载到其中。

create table testBucket (id int,name String)        
    partitioned by (region String)
    clustered by (id) into 3 buckets;    

I have set bucketing property as well. 我还设置了存储区属性。 $set hive.enforce.bucketing=true;

But when I listed the table files in HDFS I could see that that 3 files are creates as I have mentioned 3 buckets. 但是,当我在HDFS中列出表文件时,我可以看到创建了3个文件,就像我提到的3个存储桶一样。 But data got loaded in only one file and rest 2 files are just empty. 但是数据仅加载到一个文件中,其余2个文件仅为空。 So I am confused why my data got loaded into only file? 所以我很困惑为什么我的数据只加载到文件中?

So could someone please explain me how data distribution happens in bucketing? 那么有人可以向我解释一下数据存储在存储桶中是如何发生的吗?

[test@localhost user]$ hadoop fs -ls /user/hive/warehouse/database2.db/buckettab/region=USA
Found 3 items
-rw-r--r--   1 user supergroup         38 2016-06-27 08:34 /user/hive/warehouse/database2.db/buckettab/region=USA/000000_0
-rw-r--r--   1 user supergroup          0 2016-06-27 08:34 /user/hive/warehouse/database2.db/buckettab/region=USA/000001_0
-rw-r--r--   1 user supergroup          0 2016-06-27 08:34 /user/hive/warehouse/database2.db/buckettab/region=USA/000002_0

Bucketing is a method to evenly distributed the data across many files. 存储桶是一种将数据均匀分布在许多文件中的方法。 Create multiple buckets and then place each record into one of the buckets based on some logic mostly some hashing algorithm. 创建多个存储桶,然后根据某种逻辑(主要是某种哈希算法)将每个记录放入其中一个存储桶。

Bucketing feature of Hive can be used to distribute/organize the table/partition data into multiple files such that similar records are present in the same file. Hive的存储桶功能可用于将表/分区数据分发/组织到多个文件中,以便在同一文件中存在相似的记录。 While creating a Hive table, a user needs to give the columns to be used for bucketing and the number of buckets to store the data into. 创建Hive表时,用户需要提供用于存储的列以及存储数据的存储桶数。 Which records go to which bucket are decided by the Hash value of columns used for bucketing. 哪些记录转到哪个存储桶由用于存储桶的列的哈希值决定。

[Hash(column(s))] MOD [Number of buckets] [哈希(列)] MOD [存储桶数]

Hash value for different columns types is calculated differently. 不同列类型的哈希值计算方式不同。 For int columns, the hash value is equal to the value of int. 对于int列,哈希值等于int的值。 For String columns, the hash value is calculated using some computation on each character present in the String. 对于String列,哈希值是通过对String中存在的每个字符进行一些计算得出的。

Data for each bucket is stored in a separate HDFS file under the table directory on HDFS. 每个存储桶的数据存储在HDFS上表目录下的单独HDFS文件中。 Inside each bucket, we can define the arrangement of data by providing the SORT BY column while creating the table. 在每个存储桶内部,我们可以通过在创建表时提供​​SORT BY列来定义数据的排列方式。

Lets See an Example 让我们看一个例子

Creating a Hive table using bucketing 使用存储桶创建Hive表

For creating a bucketed table, we need to use CLUSTERED BY clause to define the columns for bucketing and provide the number of buckets. 为了创建存储桶的表,我们需要使用CLUSTERED BY子句来定义存储桶的列并提供存储桶的数量。 Following query creates a table Employee bucketed using the ID column into 5 buckets. 以下查询将使用ID列将表Employee存储到5个存储桶中。

CREATE TABLE Employee(
ID BIGINT,
NAME STRING, 
AGE INT,
SALARY BIGINT,
DEPARTMENT STRING 
)
COMMENT 'This is Employee table stored as textfile clustered by id into 5 buckets'
CLUSTERED BY(ID) INTO 5 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Inserting data into a bucketed table 将数据插入存储桶的表中

We have following data in Employee_old table. 我们在Employee_old表中有以下数据。

0: jdbc:hive2://localhost:10000> select * from employee_old;
+------------------+--------------------+-------------------+----------------------+--------------------------+--+
| employee_old.id  | employee_old.name  | employee_old.age  | employee_old.salary  | employee_old.department  |
+------------------+--------------------+-------------------+----------------------+--------------------------+--+
| 1                | Sudip              | 34                | 62000                | HR                       |
| 2                | Suresh             | 45                | 76000                | FINANCE                  |
| 3                | Aarti              | 25                | 37000                | BIGDATA                  |
| 4                | Neha               | 27                | 39000                | FINANCE                  |
| 5                | Rajesh             | 29                | 59000                | BIGDATA                  |
| 6                | Suman              | 37                | 63000                | HR                       |
| 7                | Paresh             | 42                | 71000                | BIGDATA                  |
| 8                | Rami               | 33                | 56000                | HR                       |
| 9                | Arpit              | 41                | 46000                | HR                       |
| 10               | Sanjeev            | 51                | 99000                | FINANCE                  |
| 11               | Sanjay             | 32                | 67000                | FINANCE                  |
+------------------+--------------------+-------------------+----------------------+--------------------------+--+

We will select data from the table Employee_old and insert it into our bucketed table Employee. 我们将从表Employee_old中选择数据,并将其插入到存储桶的表Employee中。

We need to set the property 'hive.enforce.bucketing' to true while inserting data into a bucketed table. 在将数据插入存储桶的表中时,我们需要将属性“ hive.enforce.bucketing”设置为true。 This will enforce bucketing, while inserting data into the table. 将数据插入表时,这将强制执行存储。

Set the property 设置属性

0: jdbc:hive2://localhost:10000> set hive.enforce.bucketing=true; 0:jdbc:hive2:// localhost:10000>设置hive.enforce.bucketing = true;

Insert data into Bucketed table employee 将数据插入存储桶式员工

0: jdbc:hive2://localhost:10000> INSERT OVERWRITE TABLE Employee SELECT * from Employee_old;

Verify the Data in Buckets 验证存储桶中的数据

Once we execute the INSERT query, we can verify that 5 files are created Under the Employee table directory on HDFS. 执行INSERT查询后,我们可以验证是否在HDFS的Employee表目录下创建了5个文件。

Name        Type
000000_0    file
000001_0    file
000002_0    file
000003_0    file
000004_0    file

Each file represents a bucket. 每个文件代表一个存储桶。 Let us see the contents of these files. 让我们看看这些文件的内容。

Content of 000000_0 000000_0的内容

All records with Hash(ID) mod 5 == 0 goes into this file. Hash(ID)mod 5 == 0的所有记录都将进入此文件。

5,Rajesh,29,59000,BIGDATA
10,Sanjeev,51,99000,FINANCE

Content of 000001_0 内容000001_0

All records with Hash(ID) mod 5 == 1 goes into this file. Hash(ID)mod 5 == 1的所有记录都将进入此文件。

1,Sudip,34,62000,HR
6,Suman,37,63000,HR
11,Sanjay,32,67000,FINANCE

Content of 000002_0 000002_0的内容

All records with Hash(ID) mod 5 == 2 goes into this file. 具有Hash(ID)mod 5 == 2的所有记录都将进入此文件。

2,Suresh,45,76000,FINANCE
7,Paresh,42,71000,BIGDATA

Content of 000003_0 内容000003_0

All records with Hash(ID) mod 5 == 3 goes into this file. 具有Hash(ID)mod 5 == 3的所有记录都将进入此文件。

3,Aarti,25,37000,BIGDATA
8,Rami,33,56000,HR

Content of 000004_0 内容000004_0

All records with Hash(ID) mod 5 == 4 goes into this file. 具有Hash(ID)mod 5 == 4的所有记录都将进入此文件。

4,Neha,27,39000,FINANCE
9,Arpit,41,46000,HR

I feel all ID MOD 3 will be same for USA partition (region=USA) in sample data. 我觉得示例数据中的美国分区(区域=美国)的所有ID MOD 3都相同。

Bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. 桶号由表达式hash_function(bucketing_column)mod num_buckets确定。 (There's a '0x7FFFFFFF in there too, but that's not that important). (那里也有一个'0x7FFFFFFF,但这并不重要)。 The hash_function depends on the type of the bucketing column. hash_function取决于存储分区列的类型。 For an int, it's easy, hash_int(i) == i. 对于int,这很容易,hash_int(i)== i。 For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. 例如,如果user_id是一个int,并且有10个存储桶,则我们希望所有以0结尾的user_id都位于存储桶1中,所有以1结尾的user_id都位于存储桶2中,以此类推。对于其他数据类型,则为有点棘手。 In particular, the hash of a BIGINT is not the same as the BIGINT. 特别是,BIGINT的哈希与BIGINT不同。 And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. 字符串或复杂数据类型的哈希将是从该值派生的某个数字,而不是任何人类可识别的数字。 For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets. 例如,如果user_id是STRING,则存储桶1中的user_id可能不会以0结尾。通常,基于散列分配行将使您在存储桶中平均分配。

Take a look at the language Manual here 这里看看语言手册

It states: 它指出:

How does Hive distribute the rows across the buckets? Hive如何在存储桶中分配行? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. 通常,存储桶编号由表达式hash_function(bucketing_column)mod num_buckets确定。 (There's a '0x7FFFFFFF in there too, but that's not that important). (那里也有一个'0x7FFFFFFF,但这并不重要)。 The hash_function depends on the type of the bucketing column. hash_function取决于存储分区列的类型。 For an int, it's easy, hash_int(i) == i. 对于int,这很容易,hash_int(i)== i。 For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. 例如,如果user_id是一个int,并且有10个存储桶,则我们希望所有以0结尾的user_id都位于存储桶1中,所有以1结尾的user_id都位于存储桶2中,以此类推。对于其他数据类型,则为有点棘手。 In particular, the hash of a BIGINT is not the same as the BIGINT. 特别是,BIGINT的哈希与BIGINT不同。 And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. 字符串或复杂数据类型的哈希将是从该值派生的某个数字,而不是任何人类可识别的数字。 For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets. 例如,如果user_id是STRING,则存储桶1中的user_id可能不会以0结尾。通常,基于散列分配行将使您在存储桶中平均分配。

In your case because you are clustering by Id which is an Int and then you're bucketing it into 3 buckets only it looks like all values are being hashed into one of these buckets. 在您的情况下,因为您是通过一个Int的Id进行聚类,然后将其存储到3个存储桶中,所以看起来所有值都被哈希到了其中一个存储桶中。 To ensure this is working, add some rows that have different ids from the ones you have in the file and increase the number of buckets and see if they get hashed into separate files this time around. 为确保此功能有效,请添加一些具有与文件中不同的ID的行,并增加存储桶的数量,并查看这次是否将它们散列到单独的文件中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM