繁体   English   中英

为什么我的 AWS ECS 服务无法启动我的任务?

[英]Why won't my AWS ECS service start my task?

我在使用 Terraform 在 AWS 中创建的新 AWS 负载均衡器和 AWS-ECS 存储库、集群和任务时遇到问题。 一切都在创建,没有错误。 在单独的文件中有一些 IAM 角色和证书。 这些是这里的相关定义。 发生的情况是 ECS 服务正在创建一个任务,但该任务在启动后立即关闭。 我根本没有在 Cloudwatch 日志组中看到任何日志。 事实上,它甚至从未被创造出来。

对我来说,当我第一次运行基础架构时,整个事情将无法运行,这是有道理的,因为 ECS 存储库是全新的,没有任何 Docker 映像推送到它。 但是我已经推送了图像,并且服务再也没有启动过。 我想它会在失败后无限循环尝试启动任务,但事实并非如此。

我通过销毁服务然后重新创建它来强制它重新启动。 鉴于现在有一个图像要运行,我希望它能够工作。 它具有与初始启动相同的行为,即服务创建一个无法启动的任务,并且没有记录原因,然后再也不会运行任务。

有谁知道这有什么问题,或者我可能会在哪里看到错误?

locals {
    container_name = "tdweb-web-server-container"
}

resource "aws_lb" "web_server" {
  name = "tdweb-alb"
  internal = false
  load_balancer_type = "application"
  security_groups = [aws_security_group.lb_sg.id]
  subnets = [
    aws_subnet.subnet_a.id,
    aws_subnet.subnet_b.id,
    aws_subnet.subnet_c.id
  ]
}

resource "aws_security_group" "lb_sg" {
  name = "ALB Security Group"
  description = "Allows TLS inbound traffic"
  vpc_id = aws_vpc.main.id

  ingress {
    description = "TLS from VPC"
    from_port = 443
    to_port = 443
    protocol = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port = 0
    to_port = 0
    protocol = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_security_group" "web_server_service" {
  name = "Web Sever Service Security Group"
  description = "Allows HTTP inbound traffic"
  vpc_id = aws_vpc.main.id

  ingress {
    description = "HTTP from VPC"
    from_port = 80
    to_port = 80
    protocol = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port = 0
    to_port = 0
    protocol = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_alb_listener" "https" {
  load_balancer_arn = aws_lb.web_server.arn
  port = 443
  protocol = "HTTPS"
  ssl_policy = "ELBSecurityPolicy-2016-08"
  certificate_arn = aws_acm_certificate.main.arn
  
  default_action {
    target_group_arn = aws_lb_target_group.web_server.arn
    type = "forward"
  }
}

resource "random_string" "target_group_suffix" {
  length  = 4
  upper   = false
  special = false
}

resource "aws_lb_target_group" "web_server" {
  name = "web-server-target-group-${random_string.target_group_suffix.result}"
  port = 80
  protocol = "HTTP"  
    target_type = "ip"
  vpc_id = aws_vpc.main.id
    lifecycle {
    create_before_destroy = true
  }
}

resource "aws_iam_role" "web_server_task" {
  name = "tdweb-web-server-task-role"
  assume_role_policy = data.aws_iam_policy_document.web_server_task.json
}

data "aws_iam_policy_document" "web_server_task" {
  statement {
    actions = ["sts:AssumeRole"]
        principals {
            type = "Service"
            identifiers = ["ecs-tasks.amazonaws.com"]
        }
  }
}

resource "aws_iam_role_policy_attachment" "web_server_task" {
  for_each = toset([
    "arn:aws:iam::aws:policy/AmazonSQSFullAccess",
    "arn:aws:iam::aws:policy/AmazonS3FullAccess",
    "arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess",
    "arn:aws:iam::aws:policy/AWSLambdaInvocation-DynamoDB"
  ])
  role = aws_iam_role.web_server_task.name
  policy_arn = each.value
}

resource "aws_ecr_repository" "web_server" {
  name = "tdweb-web-server-repository"
}

resource "aws_ecs_cluster" "web_server" {
  name = "tdweb-web-server-cluster"
}

resource "aws_ecs_task_definition" "web_server" {
  family = "task_definition_name"
  task_role_arn = aws_iam_role.web_server_task.arn
  execution_role_arn = aws_iam_role.ecs_task_execution.arn
  network_mode = "awsvpc"
  cpu = "1024"
  memory = "2048"
  requires_compatibilities = ["FARGATE"]
    container_definitions = <<DEFINITION
    [
    {
      "name": "${local.container_name}",
            "image": "${aws_ecr_repository.web_server.repository_url}:latest",
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/tdweb-task",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "portMappings": [
        {
          "hostPort": 80,
          "protocol": "tcp",
          "containerPort": 80
        }
      ],
      "cpu": 0,
      "essential": true
    }
    ]
    DEFINITION
}

resource "aws_ecs_service" "web_server" {
  name = "tdweb-web-server-service"
  cluster = aws_ecs_cluster.web_server.id
    launch_type = "FARGATE"
  task_definition = aws_ecs_task_definition.web_server.arn
  desired_count = 1

  load_balancer {
    target_group_arn = aws_lb_target_group.web_server.arn
    container_name = local.container_name
    container_port = 80
  }

    network_configuration {
        subnets = [
      aws_subnet.subnet_a.id,
      aws_subnet.subnet_b.id,
      aws_subnet.subnet_c.id
        ]
        assign_public_ip = true
        security_groups = [aws_security_group.web_server_service.id]
    }
}

编辑:要回答评论,这里是 VPC 和子网

resource "aws_vpc" "main" {
  cidr_block = "172.31.0.0/16"
}

resource "aws_subnet" "subnet_a" {
  vpc_id     = aws_vpc.main.id
  availability_zone = "us-east-1a"
  cidr_block = "172.31.0.0/20"
}

resource "aws_subnet" "subnet_b" {
  vpc_id     = aws_vpc.main.id
  availability_zone = "us-east-1b"
  cidr_block = "172.31.16.0/20"
}

resource "aws_subnet" "subnet_c" {
  vpc_id     = aws_vpc.main.id
  availability_zone = "us-east-1c"
  cidr_block = "172.31.32.0/20"
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
}

编辑:这是一个有点启发性的更新。 我发现此错误不是在任务日志中,而是在任务中的容器日志中。 我从来不知道在那里。

状态原因CannotPullContainerError:来自守护进程的错误响应:获取https://563407091361.dkr.ecr.us-east-1.amazonaws.com/v2/ :net/http:请求在等待连接时取消(等待连接时已超过Client.Timeout标题)

似乎该服务无法从 ECR 存储库中提取容器。 阅读后我不知道如何解决这个问题。 我还在四处张望。

根据评论,一个可能的问题是子集中缺乏互联网访问。 这可以纠正如下:

# Route table to connect to Internet Gateway

resource "aws_route_table" "public" {

  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }
}

resource "aws_route_table_association" "subnet_public_a" {
  subnet_id      = aws_subnet.subnet_a.id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "subnet_public_b" {
  subnet_id      = aws_subnet.subnet_b.id
  route_table_id = aws_route_table.public.id
}


resource "aws_route_table_association" "subnet_public_c" {
  subnet_id      = aws_subnet.subnet_c.id
  route_table_id = aws_route_table.public.id
}

您还可以将depends_on添加到您的aws_ecs_service中,以便它等待这些附件完成。

关联的较短替代方案:

locals {
  subnets = [aws_subnet.subnet_a.id, 
             aws_subnet.subnet_b.id,
             aws_subnet.subnet_c.id]
}

resource "aws_route_table_association" "subnet_public_b" {

  count          = length(local.subnets)
  
  subnet_id      = local.subnets[count.index]
  route_table_id = aws_route_table.public.id
}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM