简体   繁体   English

Hive,如何按具有 null 值的列进行分区,将所有空值放在一个分区中

[英]Hive, how to partition by a colum with null values, putting all nulls in one partition

I am using Hive, and the IDE is Hue.我使用的是 Hive,而 IDE 是色调。 I am trying different key combinations to choose for my partition key(s).我正在尝试为我的分区键选择不同的组合键。

The definition of my original table is as follows:我的原始表的定义如下:

CREATE External Table `my_hive_db`.`my_table`(
    `col_id` bigint,
    `result_section__col2` string,
    `result_section_col3` string ,
    `result_section_col4` string,
    `result_section_col5` string,
    `result_section_col6__label` string,
    `result_section_col7__label_id` bigint ,
    `result_section_text` string ,
    `result_section_unit` string,
    `result_section_col` string ,
    `result_section_title` string,
    `result_section_title_id` bigint,
    `col13` string,
    `timestamp` bigint,
    `date_day` string
    )
    PARTITIONED BY ( 
      `date_year` string, 
      `date_month` string)
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3a://some/where/in/amazon/s3'; 

The above code is working properly.上面的代码工作正常。 But when I create a new table with date_day as partition key, the table is empty and I need to run MSCK Repair Table.但是当我使用 date_day 作为分区键创建一个新表时,该表是空的,我需要运行 MSCK 修复表。 However, I am getting the following error:但是,我收到以下错误:

Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.ddl.DDLTask编译语句时出错:FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.ddl.DDLTask

在此处输入图像描述

When the partion keys were date_year, date_month, MSCK worked properly.当分区键为 date_year、date_month 时,MSCK 工作正常。

Table definition of the table I am getting the error for is as follows:我收到错误的表的表定义如下:

CREATE External Table `my_hive_db`.`my_table`(
    `col_id` bigint,
    `result_section__col2` string,
    `result_section_col3` string ,
    `result_section_col4` string,
    `result_section_col5` string,
    `result_section_col6__label` string,
    `result_section_col7__label_id` bigint ,
    `result_section_text` string ,
    `result_section_unit` string,
    `result_section_col` string ,
    `result_section_title` string,
    `result_section_title_id` bigint,
    `col13` string,
    `timestamp` bigint,
    `date_year` string, 
    `date_month` string
  )
    PARTITIONED BY (
     `date_day` string)
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3a://some/where/in/amazon/s3'; 

After this the following query is empty:在此之后,以下查询为空:

Select * From `my_hive_db`.`my_table` Limit 10;

I therefore ran the following command:因此,我运行了以下命令:

MSCK REPAIR TABLE `my_hive_db`.`my_table`;

And I get the error: Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.ddl.DDLTask我收到错误:编译语句时出错:FAILED:执行错误,从 org.apache.hadoop.Z8A4AC216FB230DATask3834DE641B3E5D0F7Z.dldl.ql 返回代码 1。

I checked this link as it is exactly the error I am getting, but by using the solution provided:我检查了这个链接,因为这正是我得到的错误,但是通过使用提供的解决方案:

set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE table_name;

I get a different error:我得到一个不同的错误:

Error while processing statement: Cannot modify hive.msck.path.validation at runtime.处理语句时出错:无法在运行时修改 hive.msck.path.validation。 It is not in list of params that are allowed to be modified at runtime.它不在允许在运行时修改的参数列表中。

I think the reason I am getting these errors is that there are more than 200 million records with date_day having null value.我认为我收到这些错误的原因是 date_day 有超过 2 亿条记录具有 null 值。

There are 31 distinct date-day not null values.有 31 个不同的日期日期,而不是 null 值。 I would like to partition my table with 32 partitions, each for a distinct value of date_day field, and all the null values get into a different partition.我想用 32 个分区对我的表进行分区,每个分区都有一个不同的 date_day 字段值,并且所有 null 值都进入不同的分区。 Is there a way to do so (partitioning by a column with null values)?有没有办法这样做(按具有 null 值的列分区)?

If this can be achieved by spark, I am also open to use it.如果这可以通过spark实现,我也愿意使用它。

This is part of a bigger problem of changing partition keys by recreating a table as mentioned in this link in answer to my other question .这是通过重新创建表来更改分区键的更大问题的一部分,如在回答我的另一个问题的链接中提到的那样

Thank you for your help.谢谢您的帮助。

You seem to not understand how Hive's partitioning work.您似乎不明白 Hive 的分区是如何工作的。 Hive stores data into files on HDFS (or S3, or some other distributed folders). Hive 将数据存储到 HDFS(或 S3,或其他一些分布式文件夹)上的文件中。 If you create a non-partitioned parquet table called my_schema.my_table , you will see in your distributed storage files stored in a folder如果您创建一个名为my_schema.my_table的非分区 parquet 表,您将在分布式存储中看到存储在文件夹中的文件

hive/warehouse/my_schema.db/my_table/part_00001.parquet
hive/warehouse/my_schema.db/my_table/part_00002.parquet
...

If you create a table partitioned by a column p_col , the files will look like如果您创建一个按列p_col分区的表,文件将如下所示

hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00002.parquet
...
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00002.parquet
...

The command MSCK repair table allows you to automatically reload the partitions, when you create an external table.命令MSCK repair table允许您在创建外部表时自动重新加载分区。

Let's say you have folders on s3 that look like this:假设您在 s3 上有如下所示的文件夹:

hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value3/part_00001.parquet

You create an external table with您创建一个外部表

CREATE External Table my_schema.my_table(
   ... some columns ...
)
PARTITIONED BY (p_col STRING)

the table will be created but empty, because Hive hasn't detected the partitions yet.该表将被创建但为空,因为 Hive 尚未检测到分区。 You run MSCK REPAIR TABLE my_schema.my_table , and Hive will recognize that your partition p_col matches the partitioning scheme on s3 ( /p_col=value1/ ).您运行MSCK REPAIR TABLE my_schema.my_table ,Hive 将识别出您的分区p_col与 s3 ( /p_col=value1/ ) 上的分区方案匹配。

From what I could understand from your other question , you are trying to change the partitioning scheme of the table by doing根据我从您的其他问题中了解到的情况,您正试图通过执行更改表的分区方案

CREATE External Table my_schema.my_table(
   ... some columns ...
)
PARTITIONED BY (p_another_col STRING)

and you are getting an error message because p_another_col doesn't match with the column used in s3, which was p_col .并且您收到一条错误消息,因为p_another_col与 s3 中使用的列p_col不匹配。 And this error is perfectly normal, since what you are doing doesn't make sense.这个错误是完全正常的,因为你所做的没有意义。

As stated in the other question's answer , you need to create a copy of the first table, with a different partitioning scheme.另一个问题的答案所述,您需要使用不同的分区方案创建第一个表的副本。

You should instead try something like this:你应该尝试这样的事情:

CREATE External Table my_hive_db.my_table_2(
    `col_id` bigint,
    `result_section__col2` string,
    `result_section_col3` string ,
    `result_section_col4` string,
    `result_section_col5` string,
    `result_section_col6__label` string,
    `result_section_col7__label_id` bigint ,
    `result_section_text` string ,
    `result_section_unit` string,
    `result_section_col` string ,
    `result_section_title` string,
    `result_section_title_id` bigint,
    `col13` string,
    `timestamp` bigint,
    `date_year` string, 
    `date_month` string
)
PARTITIONED BY (`date_day` string)

and then populate your new table with dynamic partitioning然后用动态分区填充你的新表

INSERT OVERWRITE TABLE my_hive_db.my_table_2 PARTITION(date_day)
SELECT 
  col_id,
  result_section__col2,
  result_section_col3,
  result_section_col4,
  result_section_col5,
  result_section_col6__label,
  result_section_col7__label_id,
  result_section_text,
  result_section_unit,
  result_section_col,
  result_section_title,
  result_section_title_id,
  col13,
  timestamp,
  date_year,
  date_month,
  date_day
FROM my_hive_db.my_table_1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM