How long will it take for an instance to be removed when using capacity provider with ECS?

Question

I have enabled capacity provider for ECS cluster and had set min, max and desired count on EC2 auto-scaling group as 1,3 and 2 respectively. Also I have enabled auto-scaling for ECS task with min, max and desired count as 2,6 and 2 respectively.

each of these two task were launched on two separate instance when I deployed the whole setup using terraform, when the load test was run, ECS task and EC2 instance were successfully scaled-out to 6 and 3. But after the load test got completed ECS task were scaled-in but EC2-instance is still not removed.

Also does the target_capacity in managed_scaling indicate the threshold used to create auto scaling policy for EC2 cluster?

 resource "aws_autoscaling_group" "asg" {
  ...
  min_size             = var.asg_min_size
  max_size             = var.asg_max_size
  desired_capacity     = var.asg_desired_capacity
  protect_from_scale_in = true
  tags = [
    {
      "key"                 = "Name"
      "value"               = local.name
      "propagate_at_launch" = true
    },
    {
      "key"                 = "AmazonECSManaged"
      "value"               = ""
      "propagate_at_launch" = true 
    }
  ]
}

resource "aws_ecs_capacity_provider" "capacity_provider" {
   name = local.name

   auto_scaling_group_provider {
      auto_scaling_group_arn         = aws_autoscaling_group.asg.arn
      managed_termination_protection = "ENABLED"

      managed_scaling {
           maximum_scaling_step_size = 4
           minimum_scaling_step_size = 1
           status                    = "ENABLED"
           target_capacity           = 70
      }
   }

  
   provisioner "local-exec" {
      when = destroy

      command = "aws ecs put-cluster-capacity-providers --cluster ${self.name} --capacity-providers [] --default-capacity-provider-strategy []"
   }
}

resource "aws_ecs_cluster" "cluster" {
  name      = local.name
  capacity_providers = [
    aws_ecs_capacity_provider.capacity_provider.name,
  ]
  tags = merge(
    {
      "Name"        = local.name,
      "Environment" = var.environment,
      "Description" = var.description,
      "Service"     = var.service,
    },
    var.tags
  )
}

Answer 1

You've set a target capacity of 70 for your capacity provider so the capacity provider doesn't want to go over that utilisation.

When you have 2 tasks running on 2 different instances then you would have 100% utilisation as capacity is calculated by how many instances are running non daemonset tasks. So if you have a target capacity of anything less than 100 it won't want to scale in to leave only non empty instances available.

If you had a target capacity of 60 and your ASG was allowed to scale to a max of 4 instances it would still attempt to scale out to that and leave 2 empty instances because only having 1 instance available would leave the capacity at 66.6 which is higher than that lower target capacity.

This AWS deep dive blog post on ECS capacity providers is a good read if you are just starting to use capacity providers. I think I must have read it about a dozen times when I started to use them and I still find the scaling mechanism slightly unusual. It does solve some key issues for us around naive ASG scaling based on ECS memory/CPU reservation though. It also allows scale to zero if you don't want any spare capacity (eg set a target capacity of 100) and don't mind waiting for instances to scale out.

Answer 2

Ensure that the Capacity provider and its backing Auto Scaling are both marked as Scale In protected and all running instances are marked that way as well if they've been created before the Auto Scaling parameter's change
Decrease the size of the Autoscaling Group. This by itself will not cause any change. However, if any instances are not Scale In protected, they would get immediately dropped and the tasks running on them killed
In the Cluster Instance tab, mark an instance with the lowest load and set it to drain. This will initiate a gradual move of tasks off that instance and eventually its termination as soon as it is free of running tasks

NOTE: If your Cluster was created via a Template and assigned some capacity, you may need to trace through the Alarms that preserve that capacity if drained instances "reappear" and adjust everything to the numbers you really need

How long will it take for an instance to be removed when using capacity provider with ECS?

Question

2 answers

solution1
1 ACCPTED 2021-06-22 22:20:35

solution2
0 2022-02-14 22:47:47

How long will it take for an instance to be removed when using capacity provider with ECS?

Question

2 answers

solution1 1 ACCPTED 2021-06-22 22:20:35

solution2 0 2022-02-14 22:47:47

solution1
1 ACCPTED 2021-06-22 22:20:35

solution2
0 2022-02-14 22:47:47