简体   繁体   English

Athena 存储查询结果的最佳实践

[英]Athena Best Practice to store query result

I am creating a Data Lake and have some tables in Glue Catalog that I need to query in Athena.我正在创建一个数据湖,并在 Glue 目录中有一些需要在 Athena 中查询的表。 As a prerequisite, Athena requires us to store the query results in a S3 bucket.作为先决条件,Athena 要求我们将查询结果存储在 S3 存储桶中。 I have "Temp" and "Logs" S3 buckets.我有“临时”和“日志”S3 存储桶。 But since this is client sensitive data, I just want to check should I create a new Athena bucket for this or use the existing temp/logs bucket.但由于这是客户端敏感数据,我只想检查我应该为此创建一个新的 Athena 存储桶还是使用现有的 temp/logs 存储桶。

Note : I dont have any future use of the Athena queries.注意:我以后不再使用 Athena 查询。

That's a good point you make -- the output of the Amazon Athena queries will appear in the output files, including sensitive data.这是一个很好的观点——Amazon Athena 查询的 output 将出现在 output 文件中,包括敏感数据。

You could create a bucket that only permits Write access -- that is, put a Deny policy on it so that nobody can GetObject from the bucket.您可以创建一个只允许写入访问的存储桶——也就是说,在其上放置一个拒绝策略,以便没有人可以从存储桶中获取对象。 That way, Athena is happy to write its output, but people can't see the results.那样的话,Athena 很乐意写它的 output,但是人们看不到结果。

You could also apply an Amazon S3 lifecycle policy that deletes the files after one day.您还可以应用在一天后删除文件的Amazon S3 生命周期策略

An alternate method would be to trigger an AWS Lambda function as soon as the object is created and have the Lambda function delete the object. An alternate method would be to trigger an AWS Lambda function as soon as the object is created and have the Lambda function delete the object.

Either way, ask people to direct their Athena output to that bucket if they don't need to access the results, or if there is sensitive data being retrieved.无论哪种方式,如果人们不需要访问结果或检索到敏感数据,请让他们将 Athena output 引导到该存储桶。

I would also add that Athena also keeps a history that might contain sensitive data such as PII, should that appear in your query.我还要补充一点,如果您的查询中出现,Athena 还会保留可能包含敏感数据(例如 PII)的历史记录。

Assuming the following data, DDL, and queries:假设以下数据、DDL 和查询:

Data:数据:

breed_id, breen_name, category
1,pug,toy
2,German Shepard, working,
3,Scottish Terrier, Working

DDL: DDL:

CREATE EXTERNAL TABLE default.dogs (
  `breed_id` int, 
  `breed_name` string, 
  `category` string
)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
  LINES TERMINATED BY '\n' 
LOCATION
  's3://stack-exchange/48836509'
TBLPROPERTIES ('skip.header.line.count'='1')

Queries :查询

SELECT * FROM default.dogs WHERE breed_name = 'pug'
SELECT * FROM default.dogs WHERE breed_name = 'German Shepard'

We can see these in the console:我们可以在控制台中看到这些:

在此处输入图像描述

Based on these documentation , I believe this history is stored for 45 days.根据这些文档,我相信此历史记录会存储 45 天。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM