简体   繁体   English

向 EC2 自动扩展组添加容量的 AWS CloudWatch 警报一直处于警报状态

[英]AWS CloudWatch Alarm to add capacity to EC2 autoscaling group has been in alarm forever

I set a CloudWatch Alarm to add 1 capacity unit to EC2 autoscaling group when memory reservation is > 70%.当内存预留 > 70% 时,我设置了 CloudWatch 警报以向 EC2 自动扩展组添加 1 个容量单位。 The Alarm was triggered at the right moment, but it has since been in alarm for 16 hours+ with no change at all in the EC2 autoscaling group.警报是在正确的时刻触发的,但此后它一直处于警报状态 16 个小时以上,EC2 自动扩展组中没有任何变化。 What could possibly be going wrong?可能出什么问题了?

Here's my ECS CloudFormation template:这是我的 ECS CloudFormation 模板:

ECSCluster:
  Type: AWS::ECS::Cluster
  Properties:
    ClusterName: !Ref EnvironmentName

ECSAutoScalingGroup:
  DependsOn: ECSCluster
  Type: AWS::AutoScaling::AutoScalingGroup
  Properties:
    VPCZoneIdentifier: !Ref Subnets
    LaunchConfigurationName: !Ref ECSLaunchConfiguration
    MinSize: !Ref ClusterMinSize
    MaxSize: !Ref ClusterMaxSize
    DesiredCapacity: !Ref ClusterDesiredCapacity
  CreationPolicy:
    ResourceSignal:
      Timeout: PT15M
  UpdatePolicy:
    AutoScalingRollingUpdate:
      MinInstancesInService: 1
      MaxBatchSize: 1
      PauseTime: PT15M
      SuspendProcesses:
        - HealthCheck
        - ReplaceUnhealthy
        - AZRebalance
        - AlarmNotification
        - ScheduledActions
      WaitOnResourceSignals: true

ScaleUpPolicy:
  Type: AWS::AutoScaling::ScalingPolicy
  Properties:
    AdjustmentType: ChangeInCapacity
    AutoScalingGroupName: !Ref ECSAutoScalingGroup
    Cooldown: '1'
    ScalingAdjustment: '1'

MemoryReservationAlarmHigh:
  Type: AWS::CloudWatch::Alarm
  Properties:
    EvaluationPeriods: '2'
    Statistic: Average
    Threshold: '70'
    AlarmDescription: Alarm if Cluster Memory Reservation is too high
    Period: '60'
    AlarmActions:
    - Ref: ScaleUpPolicy
    Namespace: AWS/ECS
    Dimensions:
    - Name: ClusterName
      Value: !Ref ECSCluster
    ComparisonOperator: GreaterThanThreshold
    MetricName: MemoryReservation

ScaleDownPolicy:
  Type: AWS::AutoScaling::ScalingPolicy
  Properties:
    AdjustmentType: ChangeInCapacity
    AutoScalingGroupName: !Ref ECSAutoScalingGroup
    Cooldown: '1'
    ScalingAdjustment: '-1'

MemoryReservationAlarmLow:
  Type: AWS::CloudWatch::Alarm
  Properties:
    EvaluationPeriods: '2'
    Statistic: Average
    Threshold: '30'
    AlarmDescription: Alarm if Cluster Memory Reservation is too Low
    Period: '60'
    AlarmActions:
    - Ref: ScaleDownPolicy
    Namespace: AWS/ECS
    Dimensions:
    - Name: ClusterName
      Value: !Ref ECSCluster
    ComparisonOperator: LessThanThreshold
    MetricName: MemoryReservation

ECSLaunchConfiguration:
  Type: AWS::AutoScaling::LaunchConfiguration
  Properties:
    KeyName: !If [IsProd, !Ref 'AWS::NoValue', !Ref KeyName]
    ImageId: !Ref ECSAMI
    InstanceType: !Ref InstanceType
    SecurityGroups:
      - !Ref SecurityGroup
    IamInstanceProfile: !Ref ECSInstanceProfile
    UserData:
      "Fn::Base64": !Sub |
        #!/bin/bash
        source /etc/profile.d/proxy.sh
        yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm
        yum install -y https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
        yum install -y aws-cfn-bootstrap hibagent
        cat >> /opt/aws/amazon-cloudwatch-agent/etc/common-config.toml <<EOF
        [proxy]
            http_proxy="${!http_proxy}"
            https_proxy="${!https_proxy}"
            no_proxy="${!no_proxy}"
        EOF
        /opt/aws/bin/cfn-init -v --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSLaunchConfiguration
        /opt/aws/bin/cfn-signal -e $? --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSAutoScalingGroup
        /usr/bin/enable-ec2-spot-hibernation

  Metadata:
    AWS::CloudFormation::Init:
      config:
        packages:
          yum:
            collectd: []

        commands:
          01_add_instance_to_cluster:
            command: !Sub echo ECS_CLUSTER=${ECSCluster} >> /etc/ecs/ecs.config
          02_enable_cloudwatch_agent:
            command: !Sub /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c ssm:${ECSCloudWatchParameter} -s
        files:
          /etc/cfn/cfn-hup.conf:
            mode: 000400
            owner: root
            group: root
            content: !Sub |
              [main]
              stack=${AWS::StackId}
              region=${AWS::Region}

          /etc/cfn/hooks.d/cfn-auto-reloader.conf:
            content: !Sub |
              [cfn-auto-reloader-hook]
              triggers=post.update
              path=Resources.ECSLaunchConfiguration.Metadata.AWS::CloudFormation::Init
              action=/opt/aws/bin/cfn-init -v --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSLaunchConfiguration

        services:
          sysvinit:
            cfn-hup:
              enabled: true
              ensureRunning: true
              files:
                - /etc/cfn/cfn-hup.conf
                - /etc/cfn/hooks.d/cfn-auto-reloader.conf

# This IAM Role is attached to all of the ECS hosts. It is based on the default role
# published here:
# http://docs.aws.amazon.com/AmazonECS/latest/developerguide/instance_IAM_role.html
#
# You can add other IAM policy statements here to allow access from your ECS hosts
# to other AWS services. Please note that this role will be used by ALL containers
# running on the ECS host.

ECSRole:
  Type: AWS::IAM::Role
  Properties:
    Path: /
    RoleName: !Sub ${EnvironmentName}-ECSRole-${AWS::Region}
    AssumeRolePolicyDocument: |
      {
          "Statement": [{
              "Action": "sts:AssumeRole",
              "Effect": "Allow",
              "Principal": {
                  "Service": "ec2.amazonaws.com"
              }
          }]
      }
    ManagedPolicyArns:
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/CSOPSRestrictionPolicy"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPIAMRestrictionPolicy"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPBasePolicy"
      - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
      - arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
    Policies:
      - PolicyName: ecs-service
        PolicyDocument: |
          {
              "Statement": [{
                  "Effect": "Allow",
                  "Action": [
                      "ecs:CreateCluster",
                      "ecs:DeregisterContainerInstance",
                      "ecs:DiscoverPollEndpoint",
                      "ecs:Poll",
                      "ecs:RegisterContainerInstance",
                      "ecs:StartTelemetrySession",
                      "ecs:Submit*",
                      "ecr:BatchCheckLayerAvailability",
                      "ecr:BatchGetImage",
                      "ecr:GetDownloadUrlForLayer",
                      "ecr:GetAuthorizationToken"
                  ],
                  "Resource": "*"
              }]
          }

ECSInstanceProfile:
  Type: AWS::IAM::InstanceProfile
  Properties:
    Path: /
    Roles:
      - !Ref ECSRole

ECSServiceAutoScalingRole:
  Type: AWS::IAM::Role
  Properties:
    AssumeRolePolicyDocument:
      Version: "2012-10-17"
      Statement:
        Action:
          - "sts:AssumeRole"
        Effect: Allow
        Principal:
          Service:
            - application-autoscaling.amazonaws.com
    Path: /
    ManagedPolicyArns:
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/CSOPSRestrictionPolicy"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPIAMRestrictionPolicy"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPBasePolicy"
    Policies:
      - PolicyName: ecs-service-autoscaling
        PolicyDocument:
          Statement:
            Effect: Allow
            Action:
              - application-autoscaling:*
              - cloudwatch:DescribeAlarms
              - cloudwatch:PutMetricAlarm
              - ecs:DescribeServices
              - ecs:UpdateService
            Resource: "*"

ECSCloudWatchParameter:
  Type: AWS::SSM::Parameter
  Properties:
    Description: CloudWatch Log configs for ECS cluster
    Name: !Sub AmazonCloudWatch-${ECSCluster}-ECS
    Type: String
    Value: !Sub |
      {
        "logs": {
          "force_flush_interval": 5,
          "logs_collected": {
            "files": {
              "collect_list": [
                {
                  "file_path": "/var/log/messages",
                  "log_group_name": "${ECSCluster}/var/log/messages",
                  "log_stream_name": "{instance_id}",
                  "timestamp_format": "%b %d %H:%M:%S"
                },
                {
                  "file_path": "/var/log/dmesg",
                  "log_group_name": "${ECSCluster}/var/log/dmesg",
                  "log_stream_name": "{instance_id}"
                },
                {
                  "file_path": "/var/log/docker",
                  "log_group_name": "${ECSCluster}/var/log/docker",
                  "log_stream_name": "{instance_id}",
                  "timestamp_format": "%Y-%m-%dT%H:%M:%S.%f"
                },
                {
                  "file_path": "/var/log/ecs/ecs-init.log",
                  "log_group_name": "${ECSCluster}/var/log/ecs/ecs-init.log",
                  "log_stream_name": "{instance_id}",
                  "timestamp_format": "%Y-%m-%dT%H:%M:%SZ"
                },
                {
                  "file_path": "/var/log/ecs/ecs-agent.log.*",
                  "log_group_name": "${ECSCluster}/var/log/ecs/ecs-agent.log",
                  "log_stream_name": "{instance_id}",
                  "timestamp_format": "%Y-%m-%dT%H:%M:%SZ"
                },
                {
                  "file_path": "/var/log/ecs/audit.log",
                  "log_group_name": "${ECSCluster}/var/log/ecs/audit.log",
                  "log_stream_name": "{instance_id}",
                  "timestamp_format": "%Y-%m-%dT%H:%M:%SZ"
                }
              ]
            }
          }
        },
        "metrics": {
          "append_dimensions": {
            "AutoScalingGroupName": "${!aws:AutoScalingGroupName}",
            "InstanceId": "${!aws:InstanceId}",
            "InstanceType": "${!aws:InstanceType}"
          },
          "metrics_collected": {
            "collectd": {
              "metrics_aggregation_interval": 60
            },
            "disk": {
              "measurement": [
                "used_percent"
              ],
              "metrics_collection_interval": 60,
              "resources": [
                "/"
              ]
            },
            "mem": {
              "measurement": [
                "mem_used_percent"
              ],
              "metrics_collection_interval": 60
            },
            "statsd": {
              "metrics_aggregation_interval": 60,
              "metrics_collection_interval": 10,
              "service_address": ":8125"
            }
          }
        }
      }

ECSClusterParameter:
  Type: AWS::SSM::Parameter
  Properties:
    Description: !Sub ${EnvironmentName} - ECS Cluster
    Name: !Sub /${EnvironmentName}/ecs-cluster
    Type: String
    Value: !Ref ECSCluster

ECSServiceAutoScalingRoleParameter:
  Type: AWS::SSM::Parameter
  Properties:
    Description: !Sub ${EnvironmentName} - ECS Service ASG Role
    Name: !Sub /${EnvironmentName}/ecs-service-asg-role
    Type: String
    Value: !GetAtt ECSServiceAutoScalingRole.Arn

The Alarm activity history:警报活动历史记录:

2019-12-26 11:40:54 Action  Successfully executed action arn:aws:autoscaling:ap-southeast-2:031539715286:scalingPolicy:95e836b6-2f56-498d-b931-7ec4184bedc4:autoScalingGroupName/ECS-UEBZA8GAP8S7-ECSAutoScalingGroup-1BIBTJH5I50W9:policyName/ECS-UEBZA8GAP8S7-ScaleUpPolicy-17LUWE42DC7EO
2019-12-26 11:40:54 State update  Alarm updated from OK to In alarm

Make sure there aren't any processes suspended.确保没有任何进程暂停。 Alarm notification means that incoming alarms won't trigger scaling policies.警报通知意味着传入警报不会触发扩展策略。 Launch means even if the desired goes up nothing will be launched启动意味着即使所需的上升也不会启动

Other common issues that can cause this:可能导致此问题的其他常见问题:

  • If you're using weights and increasing desired by 1, but the lowest weigh isn't 1, then it might never be able to scale.如果您使用权重并将所需的权重增加 1,但最低权重不是 1,则它可能永远无法缩放。

  • Make sure there aren't any other scaling policies being triggered that might override this one确保没有触发任何其他可能会覆盖此策略的扩展策略

  • Check the activity history to make sure there aren't any healthcheck replacements constantly happening, since that would start a 5 minute cooldown (default since one isn't set on the ASG, only the scaling policy), and would block simple scaling policies检查活动历史以确保没有任何健康检查替换不断发生,因为这将开始 5 分钟的冷却时间(默认值,因为没有在 ASG 上设置,只有扩展策略),并且会阻止简单的扩展策略

  • Make sure the desired isn't already at the Max确保所需的尚未达到最大值

  • In addition to the alarm being triggered, make sure you see in the Alarm history that the autoscaling 'action' happened (The action actually happens every minute the alarm stays in the Alarm state, no mater what your evaluation settings, but only the first one gets posted to the Alarm history)除了触发警报之外,请确保您在警报历史记录中看到自动缩放“动作”发生(该动作实际上每分钟发生一次,警报保持警报状态,无论您的评估设置如何,但只有第一个被发布到警报历史记录中)

  • Check the ASG Activity history for launch failures, this is especially common if using spot instances, and the ASG will eventually enter a backoff state after enough failures.检查 ASG 活动历史记录是否有启动失败,这在使用 Spot 实例时尤其常见,并且 ASG 在出现足够多的失败后最终会进入退避状态。 Any manual update to the group will reset this backoff对该组的任何手动更新都将重置此退避

您是否指定了“ActionsEnabled=True”?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM