简体   繁体   English

将每日快照数据从 AWS S3 移动到 AWS Athena

[英]Moving daily Snapshot Data from AWS S3 to AWS Athena

Have a daily snapshot data that gets dumped into AWS S3 bucket.将每日快照数据转储到 AWS S3 存储桶中。 Each day's data file is build on top of previous day's file.每天的数据文件都建立在前一天的文件之上。 I want to move this incremental data into AWS Athena without duplicating previous day's data.我想将这些增量数据移动到 AWS Athena 中,而不复制前一天的数据。

I learned that AWS Glue can be a handy tool to move data into Athena on daily basis.我了解到 AWS Glue 可以成为每天将数据移动到 Athena 的便捷工具。 but I am not sure how can i do it without duplicates.但我不确定如何在没有重复的情况下做到这一点。

When a table is defined in Amazon Athena (or AWS Glue), a location is provided that tells Athena where to look in Amazon S3 for the data.在 Amazon Athena(或 AWS Glue)中定义表时,会提供一个location ,告诉 Athena 在 Amazon S3 中的何处查找数据。

This is conceptually different to a traditional database where data needs to be 'loaded'.这在概念上与需要“加载”数据的传统数据库不同。 No data needs to be 'loaded' or 'moved into' Amazon Athena -- it simply looks in the specified location and uses whatever data files it sees in that location (and any subdirectories).无需“加载”或“移动到”Amazon Athena中的任何数据——它只需查看指定位置并使用在该位置(以及任何子目录)中看到的任何数据文件。

If you are producing incremental files each day, then you can simply add additional files in that S3 location (making sure the filenames do not clash with existing files).如果您每天都在生成增量文件,那么您只需在该 S3 位置添加其他文件(确保文件名不会与现有文件冲突)。 Then, when a query is next run in Amazon Athena, those files will be included in the data that is scanned.然后,当下次在 Amazon Athena 中运行查询时,这些文件将包含在扫描的数据中。

However, if you are producing a daily file with all data , then simply replace the previous file with the new file.但是,如果您要生成包含所有数据的每日文件,则只需将以前的文件替换为新文件即可。 Athena will use whatever file is in that location when running a query.运行查询时,Athena 将使用该位置中的任何文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM