简体   繁体   English

Spark 分区 Hive 表

[英]Spark Partitioning Hive Table

I am trying to partition the hive table with distinct timestamps.我正在尝试使用不同的时间戳对配置单元表进行分区。 I have a table with timestamps in it but when I execute the hive partition query, it says that it is not a valid partition column.我有一个带有时间戳的表,但是当我执行 hive 分区查询时,它说它不是有效的分区列。 Here's the table:这是表:

+---+-----------------------+
|id |rc_timestamp           |
+---+-----------------------+
|1  |2017-06-12 17:18:39.824|
|2  |2018-06-12 17:18:39.824|
|3  |2019-06-12 17:18:39.824|
+---+-----------------------+
spark.sql("SET hive.exec.dynamic.partition.mode=nonrestrict")

val tempTable  = spark.sql("SELECT * FROM partition_table")

val df = tempTable.select("rc_timestamp")

val a = x.toString().replaceAll("[\\[\\]]","")

df.collect().foreach(a => {
  spark.sql(s"ALTER TABLE mydb.partition_table ADD IF NOT EXISTS PARTITION 
  (rc_timestamp = '$a')").show()
)}

Here's the error which I'm getting:这是我得到的错误:

org.apache.spark.sql.AnalysisException: rc_timestamp is not a valid partition column 
in table mydb.partition_table.;

First thing is check your syntaxes using this InsertSuite test case specially this第一件事是使用这个InsertSuite 测试用例检查你的语法,特别是这个

AFAIK you need msck repair or refresh table AFAIK 你需要 msck 修复或刷新表

spark.sql(s"refresh table tableNameWhereYouAddedPartitions")

what it does is it will refresh the existing partitions.它的作用是刷新现有分区。

you can go with spark.sql('MSCK REPAIR TABLE table_name')你可以使用spark.sql('MSCK REPAIR TABLE table_name')

There is something called recoverPartitions (Only works with a partitioned table, and not a view).有一种叫做recoverPartitions东西(仅适用于分区表,不适用于视图)。 This is aliased version of msck repair table .这是msck repair table别名版本。 you can go ahead and try this..你可以继续尝试这个..

see this ddl.scala seems like its equalent by documentation.看到这个ddl.scala似乎与文档相等。
example usage :示例用法:

spark.catalog.recoverPartitions(tableName) 

Note: The RECOVER PARTITIONS clause automatically recognizes any data files present in these new directories, the same as the REFRESH statement does.注意:RECOVER PARTITIONS 子句自动识别这些新目录中存在的任何数据文件,与 REFRESH 语句相同。

You cannot change the partitioning scheme on a Hive table.您不能更改 Hive 表上的分区方案。 This would have to rewrite the complete dataset since partitions are mapped to folders in HDFS/S3/FileSystem.这将不得不重写完整的数据集,因为分区映射到 HDFS/S3/FileSystem 中的文件夹。

If you want to change partition scheme, the only options is to create a new table and give partitioning information in the create table command.如果要更改分区方案,唯一的选择是创建一个新表并在create table命令中提供分区信息。 After that you have to insert data into new table from the old table.之后,您必须将数据从旧表插入新表。 You can also use the CTAS command for the same.您也可以使用CTAS命令。

ALTER TABLE mydb.partition_table ADD IF NOT EXISTS PARTITION <(rc_timestamp = '$a')> - command only adds new partitions in the metastore for an existing partitioned Hive table. ALTER TABLE mydb.partition_table ADD IF NOT EXISTS PARTITION <(rc_timestamp = '$a')> - 命令仅在元存储中为现有分区 Hive 表添加新分区。 For example, let say you have a table T1 which is partitioned on column year .例如,假设您有一个表T1 ,它在列year上分区。 If you want to make metastore aware about "year=2018", then this command is used.如果您想让 Metastore 了解“year = 2018”,则使用此命令。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM