My S3 Bucket has multiple sub-directories that store data for multiple websites based on the day. example: bucket/2020-01-03/website 1 and within this are where the csv's are stored. I am able to create tables based on each of the objects but I want to create one consolidated table for all sub-directories/objects/data stored within the prefix bucket/2020-01-03 for all websites as well as all other dates.
I used the code below to create one table for
athena = boto3.client('athena',aws_access_key_id=ACCESS_KEY,aws_secret_access_key=SECRET_KEY,
region_name= 'us-west-2')
s3_input = 's3://bucket/2020-01-03/website1'
database = 'database1'
table = 'consolidated_table'
create_table = \
"""CREATE EXTERNAL TABLE IF NOT EXISTS `%s.%s` (
`website_id` string COMMENT 'from deserializer',
`user` string COMMENT 'from deserializer',
`action` string COMMENT 'from deserializer',
`date` string COMMENT 'from deserializer'
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\\"', 'separatorChar'=','
) LOCATION '%s'
TBLPROPERTIES (
'skip.header.line.count'='1',
'transient_lastDdlTime'='1576774420');""" % ( database, table, s3_input )
athena.start_query_execution(QueryString=create_table,
WorkGroup = 'user_group',
QueryExecutionContext={'Database': 'database1'},
ResultConfiguration={'OutputLocation': 's3://aws-athena-query-results-5000-us-west-2'})
I also want to over-write this table with new data from S3 everytime I run it.
You can have a consolidated table for the files from different "directories" on S3 only if all of them adhere the same data schema. As I can see from your CREATE EXTERNAL TABLE
, each file contains 4 columns website_id
, user
, action
and date
. So you can simply change LOCATION
to point to the root of your S3 "directory structure"
CREATE EXTERNAL TABLE IF NOT EXISTS `database1`.`consolidated_table` (
`website_id` string COMMENT 'from deserializer',
`user` string COMMENT 'from deserializer',
`action` string COMMENT 'from deserializer',
`date` string COMMENT 'from deserializer'
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\\"', 'separatorChar'=','
)
LOCATION 's3://bucket' -- instead of restricting it to s3://bucket/2020-01-03/website1
TBLPROPERTIES (
'skip.header.line.count'='1'
);
In this case, each Athena query would scan all files under s3://bucket
location and you can use website_id
and date
in WHERE
clause to filter results. However, if you have a lot of data you should consider partitioning . It will save you not only time to execute query but also money (see this post )
I also want to over-write this table with new data from S3 every time I run it.
I assume you mean, that every time you run Athena query, it should scan files on S3 even if they were added after you executed CREATE EXTERNAL TABLE
. Note, that CREATE EXTERNAL TABLE
simply defines a meta information about you data, ie where it is located on S3, columns etc. Thus, query against table with LOCATION 's3://bucket'
(w/o partitioning) will always include all your S3 files
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.