在HIVE中，分区列不属于基础保存数据的一部分吗？

Question

I have some log data, that has fields 我有一些日志数据，其中包含字段

id, tdate, info ID，日期，信息

I have created a dynamic partitioned table 我创建了一个动态分区表

CREATE TABLE log_partitioned(id STRING,  info STRING)
PARTITIONED BY ( tdate STRING)

and then I am loading the data 然后我正在加载数据

FROM logs lg
INSERT OVERWRITE TABLE log_partitioned PARTITION(tdate)
SELECT lg.id, lg.info, lg.tdate
DISTRIBUTE BY tdate;

It is successfully loading the data by dynamic partitioning. 它已通过动态分区成功加载数据。 But when I am trying to look at the data at 但是当我尝试查看数据时

hdfs dfs -cat /user/hive/warehouse/log_partitioned/tdate=2000-11-05/part-r-00000

Only two columns values are there. 那里只有两列值。

id1, info1 ID1，信息1
id2, info2 .... id2，info2 ....

If we run the hive query 如果我们运行配置单元查询

select * from log_partitioned limit 10

it is showing all three columns. 它显示了所有三列。 what should I do that hive also store the partition columns in the underlying data? 我应该怎么做，该配置单元还将分区列存储在基础数据中？

Answer 1

I'm fairly certain hive does not do this at all by default. 我相当确定，蜂巢默认情况下不会执行任何操作。 You may be able to accomplish it with a custom SerDe and/or Input/OutputFormat, but it could be tricky. 您也许可以使用自定义的SerDe和/或Input / OutputFormat来完成此操作，但这可能很棘手。 The problem is that anyone can put data in those partition folders, and if they put data containing the wrong value for that column, how would Hive reconcile that? 问题在于，任何人都可以将数据放入这些分区文件夹中，如果他们将包含该列错误值的数据放入，Hive将如何进行协调？

What is your use case for this? 您的用例是什么？ If you are doing the dfs -cat command you were talking about, isn't the tdate obvious from the path you're passing in? 如果正在执行您正在谈论的dfs -cat命令，那么tdate在传递的路径中是否显而易见？ If you really want it in the output of a shell command, then something like: 如果您确实希望在shell命令的输出中使用它，则类似：

dfs -cat /foo/bar/tdate=2000-11-05/part-r-00000 | sed -e 's/$/  2000-11-05/'

Another possible work-around would be to store the same data in two columns in the table. 另一个可能的解决方法是将相同的数据存储在表的两列中。 Like this: 像这样：

CREATE TABLE log_partitioned(id STRING, info STRING, tdate_1 STRING)
PARTITIONED BY (tdate_2 STRING);

FROM logs lg
INSERT OVERWRITE TABLE log_partitioned PARTITION(tdate)
SELECT lg.id, lg.info, lg.tdate as tdate_1, lg.tdate as tdate_2
DISTRIBUTE BY tdate_2;

在HIVE中，分区列不属于基础保存数据的一部分吗？

问题描述

1 个解决方案

解决方案1
2 2013-10-02 17:55:17

在HIVE中，分区列不属于基础保存数据的一部分吗？

问题描述

1 个解决方案

解决方案1 2 2013-10-02 17:55:17

解决方案1
2 2013-10-02 17:55:17