简体   繁体   English

如何使用Amazon Athena创建仅包含某些特定文件(通配符)的表?

[英]How can I create a table with only some specific files (wildcard) using Amazon Athena?

My bucket used to have this structure: 我的桶以前有这种结构:

mybucket/raw/i1.json
mybucket/raw/i2.json

It was easy and straightfoward to use Amazon Athena using the code below to create the table. 使用Amazon Athena使用下面的代码创建表格非常容易和直接。

CREATE EXTERNAL TABLE IF NOT EXISTS myclients.big_clients (
  `id_number` string,
  `txt` string,
   ...
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
) LOCATION 's3://mybucket/raw/'
TBLPROPERTIES ('has_encrypted_data'='false');

Now I'm facing some problems with a migration in the bucket structure. 现在我在存储桶结构中迁移时遇到了一些问题。

The new structure in the bucket is showed below. 桶中的新结构如下所示。

mybucket/raw/1/i1.json
mybucket/raw/1/docs/doc_1.json
mybucket/raw/1/docs/doc_2.json
mybucket/raw/1/docs/doc_3.json
mybucket/raw/2/i2.json
mybucket/raw/2/docs/doc_1.json
mybucket/raw/2/docs/doc_2.json

I wish I could create now two tables (the same table I had before the migration and a new one only with the docs.) Is there any way I could do that without having to rearrange my files in another folder? 我希望我现在可以创建两个表(我在迁移之前使用的是同一个表,只有一个新的表与文档一起使用。)有什么方法可以做到这一点而无需在另一个文件夹中重新安排我的文件? I'm searching for some kind of wildcard for the bucket files on the creation of the table. 我在创建表时正在为存储桶文件搜索某种通配符。

CREATE EXTERNAL TABLE IF NOT EXISTS myclients.big_clients (
  `id_number` string,
  `txt` string,
   ...
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'input.regex' = 'i*.json'
) LOCATION 's3://mybucket/raw/'
TBLPROPERTIES ('has_encrypted_data'='false');

CREATE EXTERNAL TABLE IF NOT EXISTS myclients.big_clients_docs (
  `dt` date,
  `txt` string,
  `id_number` string,
  `s3_doc_path` string,
  `s3_doc_path_origin` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'input.regex' = 'doc_*.json'
) LOCATION 's3://mybucket/raw/'
TBLPROPERTIES ('has_encrypted_data'='false');

I was looking for the same thing. 我在寻找同样的事情。 Unfortunately this is not possible due to the s3 api not being that wildcard friendly (requires scanning all the keys client side, which is slow). 不幸的是,这是不可能的,因为s3 api不是那种通配符友好(需要扫描客户端的所有密钥,这很慢)。 The documentation for athena also states that this is not supported. athena的文档还指出这不受支持。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM