简体   繁体   English

为什么我的 AWS ECS 服务无法启动我的任务?

[英]Why won't my AWS ECS service start my task?

I having a problem with a new AWS Load balancer and AWS-ECS repository, cluster, and task I'm creating in AWS with Terraform.我在使用 Terraform 在 AWS 中创建的新 AWS 负载均衡器和 AWS-ECS 存储库、集群和任务时遇到问题。 Everything is being created without errors.一切都在创建,没有错误。 There are some IAM roles and certificates in a separate file.在单独的文件中有一些 IAM 角色和证书。 These are the relevant definitions here.这些是这里的相关定义。 What's happening is the ECS service is creating a task, but the task shuts down immediately after it starts.发生的情况是 ECS 服务正在创建一个任务,但该任务在启动后立即关闭。 I am not seeing any logs in the Cloudwatch log group at all.我根本没有在 Cloudwatch 日志组中看到任何日志。 In fact it's never even created.事实上,它甚至从未被创造出来。

It makes sense to me that this whole thing would fail to run when I first run the infrastructure, because the ECS repository is brand new and doesn't have any Docker image pushed to it.对我来说,当我第一次运行基础架构时,整个事情将无法运行,这是有道理的,因为 ECS 存储库是全新的,没有任何 Docker 映像推送到它。 But I've pushed the image and the service never starts again.但是我已经推送了图像,并且服务再也没有启动过。 I would imagine it would infinitely loop trying to start a task after failing but it does not.我想它会在失败后无限循环尝试启动任务,但事实并非如此。

I have forced it to restart by destroying the service and then recreating it.我通过销毁服务然后重新创建它来强制它重新启动。 That I would expect to work, given that there's now an image to run.鉴于现在有一个图像要运行,我希望它能够工作。 It has the same behavior of the initial start up which is that the service creates one task which fails to start with no logs of why and then never runs a task again.它具有与初始启动相同的行为,即服务创建一个无法启动的任务,并且没有记录原因,然后再也不会运行任务。

Does anyone know what's wrong with this or perhaps where I might be able to see an error?有谁知道这有什么问题,或者我可能会在哪里看到错误?

locals {
    container_name = "tdweb-web-server-container"
}

resource "aws_lb" "web_server" {
  name = "tdweb-alb"
  internal = false
  load_balancer_type = "application"
  security_groups = [aws_security_group.lb_sg.id]
  subnets = [
    aws_subnet.subnet_a.id,
    aws_subnet.subnet_b.id,
    aws_subnet.subnet_c.id
  ]
}

resource "aws_security_group" "lb_sg" {
  name = "ALB Security Group"
  description = "Allows TLS inbound traffic"
  vpc_id = aws_vpc.main.id

  ingress {
    description = "TLS from VPC"
    from_port = 443
    to_port = 443
    protocol = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port = 0
    to_port = 0
    protocol = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_security_group" "web_server_service" {
  name = "Web Sever Service Security Group"
  description = "Allows HTTP inbound traffic"
  vpc_id = aws_vpc.main.id

  ingress {
    description = "HTTP from VPC"
    from_port = 80
    to_port = 80
    protocol = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port = 0
    to_port = 0
    protocol = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_alb_listener" "https" {
  load_balancer_arn = aws_lb.web_server.arn
  port = 443
  protocol = "HTTPS"
  ssl_policy = "ELBSecurityPolicy-2016-08"
  certificate_arn = aws_acm_certificate.main.arn
  
  default_action {
    target_group_arn = aws_lb_target_group.web_server.arn
    type = "forward"
  }
}

resource "random_string" "target_group_suffix" {
  length  = 4
  upper   = false
  special = false
}

resource "aws_lb_target_group" "web_server" {
  name = "web-server-target-group-${random_string.target_group_suffix.result}"
  port = 80
  protocol = "HTTP"  
    target_type = "ip"
  vpc_id = aws_vpc.main.id
    lifecycle {
    create_before_destroy = true
  }
}

resource "aws_iam_role" "web_server_task" {
  name = "tdweb-web-server-task-role"
  assume_role_policy = data.aws_iam_policy_document.web_server_task.json
}

data "aws_iam_policy_document" "web_server_task" {
  statement {
    actions = ["sts:AssumeRole"]
        principals {
            type = "Service"
            identifiers = ["ecs-tasks.amazonaws.com"]
        }
  }
}

resource "aws_iam_role_policy_attachment" "web_server_task" {
  for_each = toset([
    "arn:aws:iam::aws:policy/AmazonSQSFullAccess",
    "arn:aws:iam::aws:policy/AmazonS3FullAccess",
    "arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess",
    "arn:aws:iam::aws:policy/AWSLambdaInvocation-DynamoDB"
  ])
  role = aws_iam_role.web_server_task.name
  policy_arn = each.value
}

resource "aws_ecr_repository" "web_server" {
  name = "tdweb-web-server-repository"
}

resource "aws_ecs_cluster" "web_server" {
  name = "tdweb-web-server-cluster"
}

resource "aws_ecs_task_definition" "web_server" {
  family = "task_definition_name"
  task_role_arn = aws_iam_role.web_server_task.arn
  execution_role_arn = aws_iam_role.ecs_task_execution.arn
  network_mode = "awsvpc"
  cpu = "1024"
  memory = "2048"
  requires_compatibilities = ["FARGATE"]
    container_definitions = <<DEFINITION
    [
    {
      "name": "${local.container_name}",
            "image": "${aws_ecr_repository.web_server.repository_url}:latest",
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/tdweb-task",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "portMappings": [
        {
          "hostPort": 80,
          "protocol": "tcp",
          "containerPort": 80
        }
      ],
      "cpu": 0,
      "essential": true
    }
    ]
    DEFINITION
}

resource "aws_ecs_service" "web_server" {
  name = "tdweb-web-server-service"
  cluster = aws_ecs_cluster.web_server.id
    launch_type = "FARGATE"
  task_definition = aws_ecs_task_definition.web_server.arn
  desired_count = 1

  load_balancer {
    target_group_arn = aws_lb_target_group.web_server.arn
    container_name = local.container_name
    container_port = 80
  }

    network_configuration {
        subnets = [
      aws_subnet.subnet_a.id,
      aws_subnet.subnet_b.id,
      aws_subnet.subnet_c.id
        ]
        assign_public_ip = true
        security_groups = [aws_security_group.web_server_service.id]
    }
}

Edit: To answer a comment, here's the VPC and subnets编辑:要回答评论,这里是 VPC 和子网

resource "aws_vpc" "main" {
  cidr_block = "172.31.0.0/16"
}

resource "aws_subnet" "subnet_a" {
  vpc_id     = aws_vpc.main.id
  availability_zone = "us-east-1a"
  cidr_block = "172.31.0.0/20"
}

resource "aws_subnet" "subnet_b" {
  vpc_id     = aws_vpc.main.id
  availability_zone = "us-east-1b"
  cidr_block = "172.31.16.0/20"
}

resource "aws_subnet" "subnet_c" {
  vpc_id     = aws_vpc.main.id
  availability_zone = "us-east-1c"
  cidr_block = "172.31.32.0/20"
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
}

Edit: This is a somewhat enlightening update.编辑:这是一个有点启发性的更新。 I found this error not in the task logs but in the container logs within the task.我发现此错误不是在任务日志中,而是在任务中的容器日志中。 Which I never knew was there.我从来不知道在那里。

Status reason CannotPullContainerError: Error response from daemon: Get https://563407091361.dkr.ecr.us-east-1.amazonaws.com/v2/ : net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)状态原因CannotPullContainerError:来自守护进程的错误响应:获取https://563407091361.dkr.ecr.us-east-1.amazonaws.com/v2/ :net/http:请求在等待连接时取消(等待连接时已超过Client.Timeout标题)

It seems as though the service cannot pull the container from the ECR repo.似乎该服务无法从 ECR 存储库中提取容器。 I don't know how to fix this yet after doing some reading.阅读后我不知道如何解决这个问题。 I'm still looking around.我还在四处张望。

Based on the comments, a likely issue is the lack of internet access in the subsets.根据评论,一个可能的问题是子集中缺乏互联网访问。 This can be rectified as follows:这可以纠正如下:

# Route table to connect to Internet Gateway

resource "aws_route_table" "public" {

  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }
}

resource "aws_route_table_association" "subnet_public_a" {
  subnet_id      = aws_subnet.subnet_a.id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "subnet_public_b" {
  subnet_id      = aws_subnet.subnet_b.id
  route_table_id = aws_route_table.public.id
}


resource "aws_route_table_association" "subnet_public_c" {
  subnet_id      = aws_subnet.subnet_c.id
  route_table_id = aws_route_table.public.id
}

Also you can add depends_on to your aws_ecs_service so that it waits for these attachments to be completed.您还可以将depends_on添加到您的aws_ecs_service中,以便它等待这些附件完成。

A shorter alternative for the associations:关联的较短替代方案:

locals {
  subnets = [aws_subnet.subnet_a.id, 
             aws_subnet.subnet_b.id,
             aws_subnet.subnet_c.id]
}

resource "aws_route_table_association" "subnet_public_b" {

  count          = length(local.subnets)
  
  subnet_id      = local.subnets[count.index]
  route_table_id = aws_route_table.public.id
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM