简体   繁体   English

Terraform AWS Athena 使用 Glue 目录作为数据库

[英]Terraform AWS Athena to use Glue catalog as db

I'm confused as to how I should use terraform to connect Athena to my Glue Catalog database.我对如何使用 terraform 将 Athena 连接到我的 Glue Catalog 数据库感到困惑。

I use我用

resource "aws_glue_catalog_database" "catalog_database" {
    name = "${var.glue_db_name}"
}

resource "aws_glue_crawler" "datalake_crawler" {
    database_name = "${var.glue_db_name}"
    name          = "${var.crawler_name}"
    role          = "${aws_iam_role.crawler_iam_role.name}"
    description   = "${var.crawler_description}"
    table_prefix  = "${var.table_prefix}"
    schedule      = "${var.schedule}" 

    s3_target {
      path = "s3://${var.data_bucket_name[0]}"
  }
    s3_target {
      path = "s3://${var.data_bucket_name[1]}"
  }
 }

to create a Glue DB and the crawler to crawl an s3 bucket (here only two), but I don't know how I link the Athena query service to the Glue DB.创建一个 Glue DB 和爬虫来爬取一个 s3 存储桶(这里只有两个),但我不知道如何将 Athena 查询服务链接到 Glue DB。 In the terraform documentation for Athena , there doesn't appear to be a way to connect Athena to a Glue catalog but only to an S3 Bucket. Athena的 terraform 文档中,似乎没有办法将 Athena 连接到 Glue 目录,而只能连接到 S3 Bucket。 Clearly, however, Athena can be integrated with Glue .然而, 明显, Athena 可以与 Glue 集成

How can I terraform an Athena database to use my Glue catalog as its data source rather than an S3 bucket?如何对 Athena 数据库进行地形改造以使用我的 Glue 目录作为其数据源而不是 S3 存储桶?

Our current basic setup for having Glue crawl one S3 bucket and create/update a table in a Glue DB, which can then be queried in Athena, looks like this:我们当前让 Glue 抓取一个 S3 存储桶并在 Glue DB 中创建/更新表(然后可以在 Athena 中查询)的基本设置如下所示:

Crawler role and role policy:爬虫角色和角色策略:

  • The assume_role_policy of the IAM role needs only Glue as principal IAM角色的assume_role_policy只需要Glue作为principal
  • The IAM role policy allows actions for Glue, S3, and logs IAM 角色策略允许对 Glue、S3 和日志执行操作
  • The Glue actions and resources can probably be narrowed down to the ones really needed Glue 操作和资源可能可以缩小到真正需要的范围
  • The S3 actions are limited to those needed by the crawler S3 操作仅限于爬虫所需的操作
resource "aws_iam_role" "glue_crawler_role" {
  name = "analytics_glue_crawler_role"

  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "glue.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF
}

resource "aws_iam_role_policy" "glue_crawler_role_policy" {
  name = "analytics_glue_crawler_role_policy"
  role = "${aws_iam_role.glue_crawler_role.id}"
  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:*",
      ],
      "Resource": [
        "*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket",
        "s3:GetBucketAcl",
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::analytics-product-data",
        "arn:aws:s3:::analytics-product-data/*",
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": [
        "arn:aws:logs:*:*:/aws-glue/*"
      ]
    }
  ]
}
EOF
}

S3 Bucket, Glue Database and Crawler: S3 存储桶、胶水数据库和爬虫:

resource "aws_s3_bucket" "product_bucket" {
  bucket = "analytics-product-data"
  acl = "private"
}

resource "aws_glue_catalog_database" "analytics_db" {
  name = "inventory-analytics-db"
}

resource "aws_glue_crawler" "product_crawler" {
  database_name = "${aws_glue_catalog_database.analytics_db.name}"
  name = "analytics-product-crawler"
  role = "${aws_iam_role.glue_crawler_role.arn}"

  schedule = "cron(0 0 * * ? *)"

  configuration = "{\"Version\": 1.0, \"CrawlerOutput\": { \"Partitions\": { \"AddOrUpdateBehavior\": \"InheritFromTable\" }, \"Tables\": {\"AddOrUpdateBehavior\": \"MergeNewColumns\" } } }"

  schema_change_policy {
    delete_behavior = "DELETE_FROM_DATABASE"
  }

  s3_target {
    path = "s3://${aws_s3_bucket.product_bucket.bucket}/products"
  }
}

I had many things wrong in my Terraform code.我的 Terraform 代码中有很多错误。 To start with:首先:

  1. The S3 bucket argument in the aws_athena_database code refers to the bucket for query output not the data the table should be built from. aws_athena_database代码中S3存储桶参数指的是查询输出的存储桶,而不是构建表的数据。
  2. I had set up my aws_glue_crawler to write to a Glue database rather than an Athena db.我已将aws_glue_crawler设置为写入 Glue 数据库而不是 Athena 数据库。 Indeed, as Martin suggested above, once correctly set up, Athena was able to see the tables in the Glue db.事实上,正如 Martin 在上面建议的那样,一旦正确设置,Athena 就能够看到 Glue 数据库中的表。
  3. I did not have the correct policies attached to my crawler.我的爬虫没有附加正确的策略。 Initially, the only policy attached to the crawler role was最初,附加到爬虫角色的唯一策略是

    resource "aws_iam_role_policy_attachment" "crawler_attach" { policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole" role = "${aws_iam_role.crawler_iam_role.name}" }

    after setting a second policy that explicitly allowed all S3 access to all of the buckets I wanted to crawl and attaching that policy to the same crawler role, the crawler ran and updated tables successfully.在设置第二个策略明确允许所有S3访问我想要爬网的所有存储桶并将该策略附加到相同的爬网程序角色后,爬网程序成功运行并更新了表。

The second policy:政策二:

resource "aws_iam_policy" "crawler_bucket_policy" {
    name = "crawler_bucket_policy"
    path = "/"
    description = "Gives crawler access to buckets"
    policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Stmt1553807998309",
      "Action": "*",
      "Effect": "Allow",
      "Resource": "*"
    },
    {
      "Sid": "Stmt1553808056033",
      "Action": "s3:*",
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::bucket0"
    },
    {
      "Sid": "Stmt1553808078743",
      "Action": "s3:*",
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::bucket1"
    },
    {
      "Sid": "Stmt1553808099644",
      "Action": "s3:*",
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::bucket2"
    },
    {
      "Sid": "Stmt1553808114975",
      "Action": "s3:*",
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::bucket3"
    },
    {
      "Sid": "Stmt1553808128211",
      "Action": "s3:*",
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::bucket4"
    }
  ]
}
EOF
}

I'm confident that I can get away from hardcoding the bucket names in this policy but I don't yet know how to do that.我相信我可以避免在此策略中对存储桶名称进行硬编码,但我还不知道如何做到这一点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM