简体   繁体   English

在Spark中创建的数据之上创建Hive表

[英]Create Hive table on top of data created in Spark

I have created data in ORC format under Spark like this: 我已经在Spark下以ORC格式创建了数据,如下所示:

var select: String = "SELECT ..."
sqlContext.sql(select).write.format("orc").save("/tmp/out/20160101")
select = "SELECT ..."
sqlContext.sql(select).write.format("orc").save("/tmp/out/20160102")
& so on...

Now I am trying to create an external table in Hive as follows: 现在,我试图在Hive中创建一个外部表,如下所示:

CREATE EXTERNAL TABLE `mydb.mytable`
 (`col1` string, 
  `col2` decimal(38,0), 
  `create_date` timestamp, 
  `update_date` timestamp)
  PARTITIONED BY (`my_date` string)
  STORED AS ORC
  LOCATION '/tmp/out/';

When I do: 当我做:

"select count(*) from mydb.mytable"

I get count value as 0. But under Spark-shell, when I run: 我得到的计数值为0。但是在Spark-shell下,当我运行时:

val results = sqlContext.read.format("orc").load("/tmp/out/*/part*")
results.registerTempTable("results")
sqlContext.sql("select count(*) from results").show

I get 500,000 rows as expected. 我得到了500,000行。

Seems like the 'Partitions' are not getting recognized or something. 似乎“分区”未得到识别或类似的东西。 How can I create an 'External' Hive table on top of data created in Spark? 如何在Spark中创建的数据之上创建“外部” Hive表?

Hive will not automatically find new partitions. Hive不会自动查找新分区。 You need update the hive table after creating a new partition. 创建新分区后,需要更新配置单元表。 One the partition is created and added to the hive table, you can add and remove files within that partition as you like and these changes will be reflected immediately without needing to update the metastore. 一种是创建分区并将其添加到配置单元表中,您可以根据需要在该分区中添加和删除文件,这些更改将立即反映出来,而无需更新Metastore。

You can use an ALTER TABLE query to create a new partition in the metastore. 您可以使用ALTER TABLE查询在元存储中创建新分区。

ALTER TABLE mydb.mytable 
ADD PARTITION (my_date='20160101')
LOCATION '/tmp/out/20160101'

You will need to do this query for every output directory so that Hive will pick them up. 您将需要对每个输出目录执行此查询,以便Hive可以选择它们。

However, Hive has a standard naming convention for its partitions <column_name>=<value> . 但是,Hive对于其分区<column_name>=<value>具有标准的命名约定。 Using this naming scheme has a couple of advantages. 使用这种命名方案有两个优点。 Firstly, you can omit the LOCATION clause from the ALTER TABLE query, but it will also allow you to instead use a different query: MSCK REPAIR TABLE <table_name> which will add all directories as partitions into the metastore. 首先,您可以从ALTER TABLE查询中省略LOCATION子句,但它也允许您使用其他查询: MSCK REPAIR TABLE <table_name> ,它将所有目录作为分区添加到metastore中。 This is useful if you want to add many partitions at once and means you don't need to know the values of all the partition columns you are adding. 如果您要一次添加多个分区,并且不需要知道要添加的所有分区列的值,则这很有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM