简体   繁体   English

在HIVE中,分区列不属于基础保存数据的一部分吗?

[英]In HIVE, partitioned column is not part of the underlying saved data?

I have some log data, that has fields 我有一些日志数据,其中包含字段

  1. id, tdate, info ID,日期,信息

I have created a dynamic partitioned table 我创建了一个动态分区表

CREATE TABLE log_partitioned(id STRING,  info STRING)
PARTITIONED BY ( tdate STRING) 

and then I am loading the data 然后我正在加载数据

FROM logs lg
INSERT OVERWRITE TABLE log_partitioned PARTITION(tdate)
SELECT lg.id, lg.info, lg.tdate
DISTRIBUTE BY tdate;

It is successfully loading the data by dynamic partitioning. 它已通过动态分区成功加载数据。 But when I am trying to look at the data at 但是当我尝试查看数据时

hdfs dfs -cat /user/hive/warehouse/log_partitioned/tdate=2000-11-05/part-r-00000

Only two columns values are there. 那里只有两列值。

  • id1, info1 ID1,信息1

  • id2, info2 .... id2,info2 ....

If we run the hive query 如果我们运行配置单元查询

select * from log_partitioned limit 10

it is showing all three columns. 它显示了所有三列。 what should I do that hive also store the partition columns in the underlying data? 我应该怎么做,该配置单元还将分区列存储在基础数据中?

I'm fairly certain hive does not do this at all by default. 我相当确定,蜂巢默认情况下不会执行任何操作。 You may be able to accomplish it with a custom SerDe and/or Input/OutputFormat, but it could be tricky. 您也许可以使用自定义的SerDe和/或Input / OutputFormat来完成此操作,但这可能很棘手。 The problem is that anyone can put data in those partition folders, and if they put data containing the wrong value for that column, how would Hive reconcile that? 问题在于,任何人都可以将数据放入这些分区文件夹中,如果他们将包含该列错误值的数据放入,Hive将如何进行协调?

What is your use case for this? 您的用例是什么? If you are doing the dfs -cat command you were talking about, isn't the tdate obvious from the path you're passing in? 如果正在执行您正在谈论的dfs -cat命令,那么tdate在传递的路径中是否显而易见? If you really want it in the output of a shell command, then something like: 如果您确实希望在shell命令的输出中使用它,则类似:

dfs -cat /foo/bar/tdate=2000-11-05/part-r-00000 | sed -e 's/$/  2000-11-05/'

Another possible work-around would be to store the same data in two columns in the table. 另一个可能的解决方法是将相同的数据存储在表的两列中。 Like this: 像这样:

CREATE TABLE log_partitioned(id STRING, info STRING, tdate_1 STRING)
PARTITIONED BY (tdate_2 STRING);

FROM logs lg
INSERT OVERWRITE TABLE log_partitioned PARTITION(tdate)
SELECT lg.id, lg.info, lg.tdate as tdate_1, lg.tdate as tdate_2
DISTRIBUTE BY tdate_2;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM