[英]AWS CloudWatch Alarm to add capacity to EC2 autoscaling group has been in alarm forever
I set a CloudWatch Alarm to add 1 capacity unit to EC2 autoscaling group when memory reservation is > 70%.当内存预留 > 70% 时,我设置了 CloudWatch 警报以向 EC2 自动扩展组添加 1 个容量单位。 The Alarm was triggered at the right moment, but it has since been in alarm for 16 hours+ with no change at all in the EC2 autoscaling group.
警报是在正确的时刻触发的,但此后它一直处于警报状态 16 个小时以上,EC2 自动扩展组中没有任何变化。 What could possibly be going wrong?
可能出什么问题了?
Here's my ECS CloudFormation template:这是我的 ECS CloudFormation 模板:
ECSCluster:
Type: AWS::ECS::Cluster
Properties:
ClusterName: !Ref EnvironmentName
ECSAutoScalingGroup:
DependsOn: ECSCluster
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
VPCZoneIdentifier: !Ref Subnets
LaunchConfigurationName: !Ref ECSLaunchConfiguration
MinSize: !Ref ClusterMinSize
MaxSize: !Ref ClusterMaxSize
DesiredCapacity: !Ref ClusterDesiredCapacity
CreationPolicy:
ResourceSignal:
Timeout: PT15M
UpdatePolicy:
AutoScalingRollingUpdate:
MinInstancesInService: 1
MaxBatchSize: 1
PauseTime: PT15M
SuspendProcesses:
- HealthCheck
- ReplaceUnhealthy
- AZRebalance
- AlarmNotification
- ScheduledActions
WaitOnResourceSignals: true
ScaleUpPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AdjustmentType: ChangeInCapacity
AutoScalingGroupName: !Ref ECSAutoScalingGroup
Cooldown: '1'
ScalingAdjustment: '1'
MemoryReservationAlarmHigh:
Type: AWS::CloudWatch::Alarm
Properties:
EvaluationPeriods: '2'
Statistic: Average
Threshold: '70'
AlarmDescription: Alarm if Cluster Memory Reservation is too high
Period: '60'
AlarmActions:
- Ref: ScaleUpPolicy
Namespace: AWS/ECS
Dimensions:
- Name: ClusterName
Value: !Ref ECSCluster
ComparisonOperator: GreaterThanThreshold
MetricName: MemoryReservation
ScaleDownPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AdjustmentType: ChangeInCapacity
AutoScalingGroupName: !Ref ECSAutoScalingGroup
Cooldown: '1'
ScalingAdjustment: '-1'
MemoryReservationAlarmLow:
Type: AWS::CloudWatch::Alarm
Properties:
EvaluationPeriods: '2'
Statistic: Average
Threshold: '30'
AlarmDescription: Alarm if Cluster Memory Reservation is too Low
Period: '60'
AlarmActions:
- Ref: ScaleDownPolicy
Namespace: AWS/ECS
Dimensions:
- Name: ClusterName
Value: !Ref ECSCluster
ComparisonOperator: LessThanThreshold
MetricName: MemoryReservation
ECSLaunchConfiguration:
Type: AWS::AutoScaling::LaunchConfiguration
Properties:
KeyName: !If [IsProd, !Ref 'AWS::NoValue', !Ref KeyName]
ImageId: !Ref ECSAMI
InstanceType: !Ref InstanceType
SecurityGroups:
- !Ref SecurityGroup
IamInstanceProfile: !Ref ECSInstanceProfile
UserData:
"Fn::Base64": !Sub |
#!/bin/bash
source /etc/profile.d/proxy.sh
yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm
yum install -y https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
yum install -y aws-cfn-bootstrap hibagent
cat >> /opt/aws/amazon-cloudwatch-agent/etc/common-config.toml <<EOF
[proxy]
http_proxy="${!http_proxy}"
https_proxy="${!https_proxy}"
no_proxy="${!no_proxy}"
EOF
/opt/aws/bin/cfn-init -v --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSLaunchConfiguration
/opt/aws/bin/cfn-signal -e $? --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSAutoScalingGroup
/usr/bin/enable-ec2-spot-hibernation
Metadata:
AWS::CloudFormation::Init:
config:
packages:
yum:
collectd: []
commands:
01_add_instance_to_cluster:
command: !Sub echo ECS_CLUSTER=${ECSCluster} >> /etc/ecs/ecs.config
02_enable_cloudwatch_agent:
command: !Sub /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c ssm:${ECSCloudWatchParameter} -s
files:
/etc/cfn/cfn-hup.conf:
mode: 000400
owner: root
group: root
content: !Sub |
[main]
stack=${AWS::StackId}
region=${AWS::Region}
/etc/cfn/hooks.d/cfn-auto-reloader.conf:
content: !Sub |
[cfn-auto-reloader-hook]
triggers=post.update
path=Resources.ECSLaunchConfiguration.Metadata.AWS::CloudFormation::Init
action=/opt/aws/bin/cfn-init -v --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSLaunchConfiguration
services:
sysvinit:
cfn-hup:
enabled: true
ensureRunning: true
files:
- /etc/cfn/cfn-hup.conf
- /etc/cfn/hooks.d/cfn-auto-reloader.conf
# This IAM Role is attached to all of the ECS hosts. It is based on the default role
# published here:
# http://docs.aws.amazon.com/AmazonECS/latest/developerguide/instance_IAM_role.html
#
# You can add other IAM policy statements here to allow access from your ECS hosts
# to other AWS services. Please note that this role will be used by ALL containers
# running on the ECS host.
ECSRole:
Type: AWS::IAM::Role
Properties:
Path: /
RoleName: !Sub ${EnvironmentName}-ECSRole-${AWS::Region}
AssumeRolePolicyDocument: |
{
"Statement": [{
"Action": "sts:AssumeRole",
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
}
}]
}
ManagedPolicyArns:
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/CSOPSRestrictionPolicy"
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPIAMRestrictionPolicy"
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPBasePolicy"
- arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
- arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
Policies:
- PolicyName: ecs-service
PolicyDocument: |
{
"Statement": [{
"Effect": "Allow",
"Action": [
"ecs:CreateCluster",
"ecs:DeregisterContainerInstance",
"ecs:DiscoverPollEndpoint",
"ecs:Poll",
"ecs:RegisterContainerInstance",
"ecs:StartTelemetrySession",
"ecs:Submit*",
"ecr:BatchCheckLayerAvailability",
"ecr:BatchGetImage",
"ecr:GetDownloadUrlForLayer",
"ecr:GetAuthorizationToken"
],
"Resource": "*"
}]
}
ECSInstanceProfile:
Type: AWS::IAM::InstanceProfile
Properties:
Path: /
Roles:
- !Ref ECSRole
ECSServiceAutoScalingRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
Action:
- "sts:AssumeRole"
Effect: Allow
Principal:
Service:
- application-autoscaling.amazonaws.com
Path: /
ManagedPolicyArns:
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/CSOPSRestrictionPolicy"
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPIAMRestrictionPolicy"
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPBasePolicy"
Policies:
- PolicyName: ecs-service-autoscaling
PolicyDocument:
Statement:
Effect: Allow
Action:
- application-autoscaling:*
- cloudwatch:DescribeAlarms
- cloudwatch:PutMetricAlarm
- ecs:DescribeServices
- ecs:UpdateService
Resource: "*"
ECSCloudWatchParameter:
Type: AWS::SSM::Parameter
Properties:
Description: CloudWatch Log configs for ECS cluster
Name: !Sub AmazonCloudWatch-${ECSCluster}-ECS
Type: String
Value: !Sub |
{
"logs": {
"force_flush_interval": 5,
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/messages",
"log_group_name": "${ECSCluster}/var/log/messages",
"log_stream_name": "{instance_id}",
"timestamp_format": "%b %d %H:%M:%S"
},
{
"file_path": "/var/log/dmesg",
"log_group_name": "${ECSCluster}/var/log/dmesg",
"log_stream_name": "{instance_id}"
},
{
"file_path": "/var/log/docker",
"log_group_name": "${ECSCluster}/var/log/docker",
"log_stream_name": "{instance_id}",
"timestamp_format": "%Y-%m-%dT%H:%M:%S.%f"
},
{
"file_path": "/var/log/ecs/ecs-init.log",
"log_group_name": "${ECSCluster}/var/log/ecs/ecs-init.log",
"log_stream_name": "{instance_id}",
"timestamp_format": "%Y-%m-%dT%H:%M:%SZ"
},
{
"file_path": "/var/log/ecs/ecs-agent.log.*",
"log_group_name": "${ECSCluster}/var/log/ecs/ecs-agent.log",
"log_stream_name": "{instance_id}",
"timestamp_format": "%Y-%m-%dT%H:%M:%SZ"
},
{
"file_path": "/var/log/ecs/audit.log",
"log_group_name": "${ECSCluster}/var/log/ecs/audit.log",
"log_stream_name": "{instance_id}",
"timestamp_format": "%Y-%m-%dT%H:%M:%SZ"
}
]
}
}
},
"metrics": {
"append_dimensions": {
"AutoScalingGroupName": "${!aws:AutoScalingGroupName}",
"InstanceId": "${!aws:InstanceId}",
"InstanceType": "${!aws:InstanceType}"
},
"metrics_collected": {
"collectd": {
"metrics_aggregation_interval": 60
},
"disk": {
"measurement": [
"used_percent"
],
"metrics_collection_interval": 60,
"resources": [
"/"
]
},
"mem": {
"measurement": [
"mem_used_percent"
],
"metrics_collection_interval": 60
},
"statsd": {
"metrics_aggregation_interval": 60,
"metrics_collection_interval": 10,
"service_address": ":8125"
}
}
}
}
ECSClusterParameter:
Type: AWS::SSM::Parameter
Properties:
Description: !Sub ${EnvironmentName} - ECS Cluster
Name: !Sub /${EnvironmentName}/ecs-cluster
Type: String
Value: !Ref ECSCluster
ECSServiceAutoScalingRoleParameter:
Type: AWS::SSM::Parameter
Properties:
Description: !Sub ${EnvironmentName} - ECS Service ASG Role
Name: !Sub /${EnvironmentName}/ecs-service-asg-role
Type: String
Value: !GetAtt ECSServiceAutoScalingRole.Arn
The Alarm activity history:警报活动历史记录:
2019-12-26 11:40:54 Action Successfully executed action arn:aws:autoscaling:ap-southeast-2:031539715286:scalingPolicy:95e836b6-2f56-498d-b931-7ec4184bedc4:autoScalingGroupName/ECS-UEBZA8GAP8S7-ECSAutoScalingGroup-1BIBTJH5I50W9:policyName/ECS-UEBZA8GAP8S7-ScaleUpPolicy-17LUWE42DC7EO
2019-12-26 11:40:54 State update Alarm updated from OK to In alarm
Make sure there aren't any processes suspended.确保没有任何进程暂停。 Alarm notification means that incoming alarms won't trigger scaling policies.
警报通知意味着传入警报不会触发扩展策略。 Launch means even if the desired goes up nothing will be launched
启动意味着即使所需的上升也不会启动
Other common issues that can cause this:可能导致此问题的其他常见问题:
If you're using weights and increasing desired by 1, but the lowest weigh isn't 1, then it might never be able to scale.如果您使用权重并将所需的权重增加 1,但最低权重不是 1,则它可能永远无法缩放。
Make sure there aren't any other scaling policies being triggered that might override this one确保没有触发任何其他可能会覆盖此策略的扩展策略
Check the activity history to make sure there aren't any healthcheck replacements constantly happening, since that would start a 5 minute cooldown (default since one isn't set on the ASG, only the scaling policy), and would block simple scaling policies检查活动历史以确保没有任何健康检查替换不断发生,因为这将开始 5 分钟的冷却时间(默认值,因为没有在 ASG 上设置,只有扩展策略),并且会阻止简单的扩展策略
Make sure the desired isn't already at the Max确保所需的尚未达到最大值
In addition to the alarm being triggered, make sure you see in the Alarm history that the autoscaling 'action' happened (The action actually happens every minute the alarm stays in the Alarm state, no mater what your evaluation settings, but only the first one gets posted to the Alarm history)除了触发警报之外,请确保您在警报历史记录中看到自动缩放“动作”发生(该动作实际上每分钟发生一次,警报保持警报状态,无论您的评估设置如何,但只有第一个被发布到警报历史记录中)
Check the ASG Activity history for launch failures, this is especially common if using spot instances, and the ASG will eventually enter a backoff state after enough failures.检查 ASG 活动历史记录是否有启动失败,这在使用 Spot 实例时尤其常见,并且 ASG 在出现足够多的失败后最终会进入退避状态。 Any manual update to the group will reset this backoff
对该组的任何手动更新都将重置此退避
您是否指定了“ActionsEnabled=True”?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.