简体   繁体   English

AWS Athena:CTAS查询结果的跨帐户写入

[英]AWS Athena: cross account write of CTAS query result

I have big historical dataset in an account A . 我的帐户A中有大量历史数据集。 This dataset is in csv format and partitioned by year/month/day/hour/ . 此数据集采用csv格式,并按year/month/day/hour/划分。 My goal is to convert this data to parquet, with additional normalisation steps and extra level of partitioning, eg year/month/day/hour/product/ , and write it back to the same bucket of the account A under processed/ "directory". 我的目标是通过额外的标准化步骤和额外的分区级别(例如year/month/day/hour/product/将此数据转换为镶木地板,并将其写回到已processed/ “目录”下帐户A同一存储桶中 So "directory" tree would look like 所以“目录”树看起来像

S3_bucket_Account_A

dataset
|
├── raw
│   ├── year=2017
|   │   ├── month=01
|   |   │   ├── day=01
|   │   |   |   ├── hour=00
|   │   |   |   └── hour=01
|                                 
├── processed
│   ├── year=2017
|   │   ├── month=01
|   |   │   ├── day=01
|   |   |   │   ├── hour=00
|   |   │   |   |   ├── product=A
|   |   │   |   |   └── product=B
|   |   |   │   ├── hour=01
|   |   │   |   |   ├── product=A
|   |   │   |   |   └── product=B

In order to do that, I am sending CTAS query statements to Athena with boto3 API. 为此,我使用boto3 API将CTAS查询语句发送到Athena。 I am aware of limitations of CTAS queries , eg can write in up to 100 partitions within the same query, location of CTAS query result must be empty/unique. 我知道CTAS查询局限性 ,例如可以在同一查询中写入多达100个分区,CTAS查询结果的位置必须为空/唯一。 So, I process one raw partition at the time and content of CTAS query is being generated on a fly taking those limitation in consideration. 因此,我当时处理一个原始分区,并且考虑到这些限制,正在动态生成CTAS查询的内容。

Since I am using account B to execute these CTAS queries, but result of these queries should be written into S3 bucket owned by account A . 由于我使用帐户B执行这些CTAS查询,因此应将这些查询的结果写入帐户A拥有的 S3存储桶中。 I have been given the following permissions which are specified at the Bucket policy level of account A. 我已获得以下在帐户A的存储桶策略级别上指定的权限。

{
    "Effect": "Allow",
    "Principal": {
        "AWS": "__ARN_OF_ACCOUNT_B__"
    },
    "Action": [
        "s3:*"
    ],
    "Resource": [
        "arn:aws:s3:::dataset",
        "arn:aws:s3:::dataset/*"
    ]
}

The problem is that account A (bucket owner) doesn't have access to the files that have been written as a result of CTAS query executed by Athena of account B . 问题是帐户A(存储桶拥有者)无权访问由于帐户B的Athena执行CTAS查询而写入的文件

As I understand, there is an option of account A creating an IAM role for me, and then I would perform this task as if I were account A. But unfortunately, this options is out of the question. 据我了解,帐户A有一个为我创建IAM角色的选项,然后我将执行该任务,就好像我是帐户A。但是很遗憾,这个选项是不可能的。

I have found ways on how to transfer ownership/change ACL of S3 objects. 我找到了有关如何转移S3对象的所有权/更改ACL的方法。 One way would be to output CTAS query result in S3 bucket of account B and then copy these files to bucket of account A ( original source ) 一种方法是将CTAS查询结果输出到帐户B的S3存储桶中,然后将这些文件复制到帐户A的存储桶中( 原始来源

aws s3 cp s3://source_awsexamplebucket/ s3://destination_awsexamplebucket/ --acl bucket-owner-full-control --recursive

Another way is recursively update acl with something like ( original source ) 另一种方法是递归更新acl之类的东西( 原始源代码

aws s3 ls s3://bucket/path/ --recursive | awk '{cmd="aws s3api put-object-acl --acl bucket-owner-full-control --bucket bucket --key "$4; system(cmd)}'

But these two options would require additional GET and PUT requests to S3, thus more money to pay for AWS. 但是这两个选项将需要对S3的其他GETPUT请求,因此需要更多的钱来支付AWS。 But more importantly, I update AWS Glue table (destination table) of account A with partitions from the created table after CTAS query succeeded. 但更重要的是,在成功进行CTAS查询之后,我使用创建的表中的分区更新了帐户A的 AWS Glue表(目标表)。 In this way, IAM users in account A can start query transformed data straight away. 这样, 帐户A中的 IAM用户可以立即开始查询转换后的数据。 Here is a general idea of how I update destination_table 这是我如何更新destination_table的一般思路

response = glue_client.get_partitions(
    CatalogId="__ACCOUNT_B_ID__",
    DatabaseName="some_database_in_account_B",
    TableName="ctas_table"
)

for partition in response["Partitions"]:
    for key in ["DatabaseName", "TableName", "CreationTime"]:
        partition.pop(key)

glue_client.batch_create_partition(
    CatalogId="__ACCOUNT_A_ID__",
    DatabaseName="some_database_in_account_A",
    TableName="destination_table",
    PartitionInputList=response["Partitions"]
)

I do it in this way instead of MSCK REPAIR TABLE destination_table because the latter takes takes long time for some reason. 我用这种方式代替了MSCK REPAIR TABLE destination_table因为后者出于某种原因花费很长时间。 So as you can see, if I opt for use of aws s3 cp I would also need to take that into account when I copy meta information about partitions 如您所见,如果我选择使用aws s3 cp ,则在复制有关分区的元信息时也需要考虑到这一点

So my real question is how can I grant full control to the owner of the bucket within CTAS query executed by another account? 因此,我真正的问题是,如何在另一个帐户执行的CTAS查询中授予存储桶的所有者完全控制权?

Update 2019-06-25: 更新2019-06-25:

Just found similar post , but it seems that they use IAM role which is not an option for my case 刚刚找到类似的帖子 ,但似乎他们使用了IAM角色,这不是我的情况的选择

Update 2019-06-27 更新2019-06-27

I found out that: 1) It is not possible to change ACL within CTAS query. 我发现:1)无法在CTAS查询中更改ACL。 Instead, an S3 object can be copied on itself (thanks to comments from John Rotenstein and Theo ) with a new ownership. 相反,可以使用新所有权将S3对象复制到自身上(感谢John RotensteinTheo的评论)。

Update 2019-06-30 更新2019-06-30

Just to recap. 只是回顾一下。 I run CTAS query from account B but result is saved in a bucket owned by account A . 我从account B运行CTAS查询,但结果保存在account A拥有的存储桶中。 This is how CTAS query "header" looks like: 这就是CTAS查询“标头”的样子:

CREATE TABLE some_database_in_account_B.ctas_table
WITH (
  format = 'PARQUET',
  external_location = 's3://__destination_bucket_in_Account_A__/__CTAS_prefix__/',
  partitioned_by = ARRAY['year', 'month', 'day', 'hour', 'product']
) AS (
    ...
    ...
)

Since I use boto3 to submit CTAS queries and I know __destination_bucket_in_Account_A__ together with __CTAS_prefix__ , then instead of copying files on themselves with aws cp I can directly change their ACL within the same python script upon successful execution of CTAS query. 由于我使用boto3提交CTAS查询,并且我知道__destination_bucket_in_Account_A____destination_bucket_in_Account_A__在一起, __CTAS_prefix__我可以在成功执行CTAS查询后直接在同一python脚本中直接更改其ACL,而不用使用aws cp复制文件。

s3_resource = aws_session.resource('s3')
destination_bucket = s3_resource.Bucket(name="__destination_bucket_in_Account_A__")

for obj in destination_bucket.objects.filter(Prefix="__CTAS_prefix__"):
    object_acl = s3_resource.ObjectAcl(destination_bucket.name, obj.key)
    object_acl.put(
        ACL='bucket-owner-full-control'
    )

Note , since I need to submit a number CTAS queries which exceeds the limitation of AWS Athena, I already have implemented logic that automatically submits new queries and performs some additional things, eg updating destination Glue table and logging. 注意 ,由于我需要提交数量超过AWS Athena限制的CTAS查询,因此我已经实现了自动提交新查询并执行一些其他操作的逻辑,例如更新目标Glue表和日志记录。 Therefore, including these lines of code is quite straight forward. 因此,包括这些代码行是很简单的。

I would recommend that you perform the copy. 我建议您执行复制。

The "additional GET and PUT requests" would be minor: “其他GET和PUT请求”将是次要的:

  • GET is $0.0004 per 1,000 requests GET是每1,000个请求0.0004 USD
  • PUT is $0.005 per 1,000 requests 每1000个请求的PUT为0.005 USD

Alternatively, you run a aws s3 cp --recursive command from Account B to copy the files to themselves (yes!) with a change of ownership (it also needs another change, like setting metadata to be accepted as a copy command). 另外,您可以从帐户B运行aws s3 cp --recursive命令,以更改所有权(是,还需要进行其他更改,例如将元数据设置为复制命令)来将文件复制到自身(是!)。 This is similar to what you were proposing with put-object-acl . 这类似于您对put-object-acl提出的建议。

Currently, the only way to do this cleanly is to use an IAM role in account A with a trust policy that allows account B to assume the role. 当前,唯一做到这一点的唯一方法是在帐户A中使用IAM角色,并使用允许帐户B承担角色的信任策略。 You mention that this is not possible for your case, which is unfortunate. 您提到这对您的情况是不可能的,这是不幸的。 The reason why it's currently not possible any other way is that Athena will not write files with the "bucket-owner-full-control" option, so account A will never fully own any files created by an action initiated by a role in account B. 当前无法通过其他方式进行操作的原因是,Athena不会使用“ bucket-owner-full-control”选项来写入文件,因此帐户A将永远不会完全拥有由帐户B中的角色发起的操作创建的任何文件。

Since the policy you have been granted in the destination bucket permits everything, one thing you can do is run a task after the CTAS operation finishes that lists the objects created and copies each one to itself (same source and destination keys) with the "bucket-owner-full-control" ACL option. 由于您已在目标存储桶中被授予的策略允许所有操作,因此您可以做的一件事是在CTAS操作完成后运行一项任务,该任务列出创建的对象并使用“存储桶”将每个对象复制到自身(相同的源密钥和目标密钥) -owner-full-control” ACL选项。 Copying an object like this is a common way to change storage and ACL properties of S3 objects. 像这样复制对象是更改S3对象的存储和ACL属性的常用方法。 This will, as you say, incur additional charges, but they will be minuscule in comparison to the CTAS charges, and charges related to future queries against the data. 正如您所说,这会产生额外的费用,但与CTAS费用以及与将来查询数据有关的费用相比,这些费用微不足道。

The real downside is having to write something to run after the CTAS operation, and coordinating that. 真正的缺点是必须编写一些内容以在CTAS操作之后运行并进行协调。 I suggest looking at Step Functions to do it, you can make quite nice workflows that automate Athena and that cost very little to run. 我建议您看一下Step Functions来做到这一点,您可以制作出很好的工作流程来自动化Athena,并且运行成本很小。 I have applications that do more or less exactly what you're trying to do that use Step Functions, Lambda, and Athena and cost pennies (I use IAM roles for the cross account work, though). 我的应用程序使用Step Functions,Lambda和Athena以及成本几分钱(实际上我使用IAM角色进行跨帐户工作)或多或少地完全按照您的意图进行操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM