简体   繁体   中英

How can I create a table with only some specific files (wildcard) using Amazon Athena?

My bucket used to have this structure:

mybucket/raw/i1.json
mybucket/raw/i2.json

It was easy and straightfoward to use Amazon Athena using the code below to create the table.

CREATE EXTERNAL TABLE IF NOT EXISTS myclients.big_clients (
  `id_number` string,
  `txt` string,
   ...
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
) LOCATION 's3://mybucket/raw/'
TBLPROPERTIES ('has_encrypted_data'='false');

Now I'm facing some problems with a migration in the bucket structure.

The new structure in the bucket is showed below.

mybucket/raw/1/i1.json
mybucket/raw/1/docs/doc_1.json
mybucket/raw/1/docs/doc_2.json
mybucket/raw/1/docs/doc_3.json
mybucket/raw/2/i2.json
mybucket/raw/2/docs/doc_1.json
mybucket/raw/2/docs/doc_2.json

I wish I could create now two tables (the same table I had before the migration and a new one only with the docs.) Is there any way I could do that without having to rearrange my files in another folder? I'm searching for some kind of wildcard for the bucket files on the creation of the table.

CREATE EXTERNAL TABLE IF NOT EXISTS myclients.big_clients (
  `id_number` string,
  `txt` string,
   ...
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'input.regex' = 'i*.json'
) LOCATION 's3://mybucket/raw/'
TBLPROPERTIES ('has_encrypted_data'='false');

CREATE EXTERNAL TABLE IF NOT EXISTS myclients.big_clients_docs (
  `dt` date,
  `txt` string,
  `id_number` string,
  `s3_doc_path` string,
  `s3_doc_path_origin` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'input.regex' = 'doc_*.json'
) LOCATION 's3://mybucket/raw/'
TBLPROPERTIES ('has_encrypted_data'='false');

I was looking for the same thing. Unfortunately this is not possible due to the s3 api not being that wildcard friendly (requires scanning all the keys client side, which is slow). The documentation for athena also states that this is not supported.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM