简体   繁体   中英

Create table in Athena using all objects from multiple folders in S3 Bucket via Boto3

My S3 Bucket has multiple sub-directories that store data for multiple websites based on the day. example: bucket/2020-01-03/website 1 and within this are where the csv's are stored. I am able to create tables based on each of the objects but I want to create one consolidated table for all sub-directories/objects/data stored within the prefix bucket/2020-01-03 for all websites as well as all other dates.

I used the code below to create one table for

Athena configuration

athena = boto3.client('athena',aws_access_key_id=ACCESS_KEY,aws_secret_access_key=SECRET_KEY, 
region_name= 'us-west-2')
s3_input = 's3://bucket/2020-01-03/website1'

database = 'database1'
table = 'consolidated_table'

Athena database and table definition

create_table = \

 """CREATE EXTERNAL TABLE IF NOT EXISTS `%s.%s` (
  `website_id` string COMMENT 'from deserializer', 
  `user` string COMMENT 'from deserializer', 
  `action` string COMMENT 'from deserializer', 
  `date` string COMMENT 'from deserializer'
     )
     ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
     WITH SERDEPROPERTIES (
     'escapeChar'='\\"', 'separatorChar'=','
     ) LOCATION '%s'
     TBLPROPERTIES (
  'skip.header.line.count'='1', 
  'transient_lastDdlTime'='1576774420');""" % ( database, table, s3_input )

athena.start_query_execution(QueryString=create_table, 
WorkGroup = 'user_group',
QueryExecutionContext={'Database': 'database1'},
ResultConfiguration={'OutputLocation': 's3://aws-athena-query-results-5000-us-west-2'})

I also want to over-write this table with new data from S3 everytime I run it.

You can have a consolidated table for the files from different "directories" on S3 only if all of them adhere the same data schema. As I can see from your CREATE EXTERNAL TABLE , each file contains 4 columns website_id , user , action and date . So you can simply change LOCATION to point to the root of your S3 "directory structure"

CREATE EXTERNAL TABLE IF NOT EXISTS `database1`.`consolidated_table` (
    `website_id` string COMMENT 'from deserializer', 
    `user` string COMMENT 'from deserializer', 
    `action` string COMMENT 'from deserializer', 
    `date` string COMMENT 'from deserializer'
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
    'escapeChar'='\\"', 'separatorChar'=','
) 
LOCATION 's3://bucket' -- instead of restricting it to s3://bucket/2020-01-03/website1
TBLPROPERTIES (
    'skip.header.line.count'='1'
);

In this case, each Athena query would scan all files under s3://bucket location and you can use website_id and date in WHERE clause to filter results. However, if you have a lot of data you should consider partitioning . It will save you not only time to execute query but also money (see this post )

I also want to over-write this table with new data from S3 every time I run it.

I assume you mean, that every time you run Athena query, it should scan files on S3 even if they were added after you executed CREATE EXTERNAL TABLE . Note, that CREATE EXTERNAL TABLE simply defines a meta information about you data, ie where it is located on S3, columns etc. Thus, query against table with LOCATION 's3://bucket' (w/o partitioning) will always include all your S3 files

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM