简体繁体 English

如何在CloudWatch中使用ELB的HealthyHostCount进行监控？

[英]How do I use ELB's HealthyHostCount for monitoring in CloudWatch?

原文 2012-07-23 08:29:10 7 2 amazon-ec2/ amazon-web-services/ metrics/ amazon-elb/ amazon-cloudwatch

We have three EC2 instances—one in each availability zone (AZ) in the eu-west-1 region. 我们有三个EC2实例 - 在eu-west-1区域的每个可用区（AZ）中有一个。 They are loadbalanced using ELB. 它们使用ELB进行负载平衡。 We'd like to monitor how many instances are registered at the loadbalancer, using CloudWatch. 我们想要使用CloudWatch监控在负载均衡器上注册的实例数量。 The problem ist: I don't really understand the HealthyHostCount metric. 问题是：我真的不了解HealthyHostCount指标。

For a deployment, we'd like to be able to de-register a single instance (take it out of the LB) without being notified. 对于部署，我们希望能够在不通知的情况下取消注册单个实例（将其从LB中取出）。 So the alarm would be: Notify if there is only 1 healthy instance left behind the loadbalancer for 5 minutes. 因此警报将是：通知负载均衡器后面是否只有1个健康实例持续5分钟。

As far as I understand, HealthyHostCount (HHC) is the number of healthy instances that are registered with a given ELB, averaged over all AZs. 据我所知， HealthyHostCount （HHC）是在给定的ELB中注册的健康实例的数量，在所有AZ上平均。 If everything is okay, the HHC should be 1 (no matter over what period of time) because there is 1 instance in each AZ. 如果一切正常，HHC应该是1（无论在什么时间段内），因为每个AZ中有1个实例。

A couple of days ago, someone deployed without re-registering the instances, so there was only 1 instance being balanced. 几天前，有人在没有重新注册实例的情况下部署，因此只有一个实例是平衡的。 When we noticed that, we created an alarm that was to notify us when the average HHC sunk below 0.6 after 5 minutes. 当我们注意到这一点时，我们创建了一个警报，当5分钟后平均HHC低于0.6时，通知我们。 (If only 1 instance is registered in ELB, the HHC should average 0.33 for any period of time.) However, the alarm never changed to state "ALARM." （如果在ELB中只注册了1个实例，则HHC在任何时间段内应平均为0.33。）但是，警报从未更改为“ALARM”状态。

When I checked the HHC in CloudWatch, the HHC were numbers that didn't make sense (sum of 10.0 for a 5-minute interval is all I remember now). 当我在CloudWatch中检查HHC时，HHC是没有意义的数字（5分钟间隔的总和为现在我记得的全部）。

It's all a big mess to me. 这对我来说都是一团糟。 Any time I think I understand the metric, the CloudWatch charts are all gibberish to me. 每当我认为我理解该指标时，CloudWatch图表对我来说都是胡言乱语。

Could someone please explain how to use HHC to get an alarm when only 1 instance is registered? 有人可以解释如何只注册一个实例时如何使用HHC来发出警报？ Is average HHC the way to go or should I use another metric? 平均HHC是走的路还是我应该使用其他指标？

2 个解决方案

The HealthyHostCount metric records one data value with the count of available hosts for each availability zone, each time a health check is executed. 每次执行运行状况检查时， HealthyHostCount指标都会记录一个数据值，其中包含每个可用区的可用主机数。 Your ELB health check has an Interval parameter that defines how many health checks are executed per minute. 您的ELB运行状况检查具有Interval参数，该参数定义每分钟执行的运行状况检查数。

If you are watching a Per-AZ metric, with a health check Interval of 10 seconds, with 2 healthy hosts in that AZ, you will see 6 data points per minute ( 60/10 ) with a value of 2. The average, max and min will be 2, but the sum will be 6*2=12 . 如果您正在观看每个AZ指标，健康检查Interval为10秒，并且该AZ中有2个健康主机，您将看到每分钟6个数据点（ 60/10 ），其值为2.平均值，最大值和min将是2，但总和将是6*2=12 。

If you have 3 AZs with 2 hosts each, again with an Interval =10, but you are looking at the Per-LB metric, you will see 3*6=18 data points per minute, each one with a value of 2. The average, max and min will be 2, but the sum will be 18*2=36 如果您有3个AZ，每个主机有2个主机，同样Interval = 10，但是您正在查看Per-LB指标，您将看到每分钟3*6=18数据点，每个数据点的值为2。 average，max和min将为2，但总和将为18*2=36

I recommend you to set-up an interval value that can divide 60 seconds (either 5, 6, 10, 15, 20, 30 or 60 seconds). 我建议你设置一个可以分为60秒（5,6,10,15,20,30或60秒）的间隔值。

In your case, if your interval is 30 seconds, and you have 3 AZs and 1 server per AZ: You should expect 2 data points per AZ per minute, so set-up an alarm Per-LB, with a Period of 1 minute, for Sum of HealthyHostCount that triggers when value is LowerOrEqual than 2 ( 2 data values * 1 Healthy AZ * 1 healthy server = 2 , the other 4 data values of the unhealthy AZs should be 0 so they won't affect the sum). 在您的情况下，如果间隔为30秒，并且每个AZ有3个AZ和1个服务器：每分钟每个AZ应该有2个数据点，因此设置每个LB的警报， Period为1分钟，对于Sum of HealthyHostCount当值是LowerOrEqual触发大于2（ 2 data values * 1 Healthy AZ * 1 healthy server = 2 ，不健康AZS的其它4个数据值应该为0，从而它们将不影响总和）。

UPDATE: 更新：

It turns out that the number of health check executed also depends on the number of internal instances that shapes the ELB (ussually one per AZ), so if you are suffering a traffic spike, or enough load to saturate a single elb-internal-instance, the amount of internal servers inside the ELB will grow and you will have more data points unexpectedly. 事实证明，执行的运行状况检查的数量还取决于形成ELB的内部实例的数量（通常每个AZ一个），因此如果您遇到流量峰值或足够的负载以使单个elb-internal-instance饱和，ELB内部服务器的数量将增加，您将意外地拥有更多数据点。 This may affect the sum value, only if you have lots of traffic. 仅当您有大量流量时，这可能会影响sum值。 I didn't saw this issue with a peak load of 6k RPM distributed in 3 AZs. 我没有看到这个问题，在3个AZ中分配了6k RPM的峰值负载。 If this is your scenario, then using average is a safer bet, but I would recommend that you use LowerThan 0.65 as your threshold. 如果这是你的情况，那么使用average是一个更安全的赌注，但我建议你使用LowerThan 0.65作为你的门槛。

The link also makes me wonder how does the Cross-Zone Load Balancing feature affects the amount of data points... 该链接还让我想知道Cross-Zone Load Balancing功能如何影响数据点的数量......

This is an area where the CloudWatch web console doesn't expose everything that cloud watch can do. 这是CloudWatch Web控制台不会公开云监视可以执行的所有操作的区域。 As the docs explain, HealthyHostCount is a per availability zone metric. 正如文档所解释的， HealthyHostCount是每个可用区域度量标准。 The console lets you have HealthHostCount by availability zone (but across all load balancers) or by load balancer (but across all zones) but not sliced both ways. 控制台允许您按可用区域（但跨所有负载均衡器）或负载均衡器（但跨所有区域）拥有HealthHostCount，但不能双向切片。

If you only have one load balancer the simplest thing would be to setup one alarm on each of the per zone metrics. 如果您只有一个负载均衡器，最简单的方法是在每个区域度量标准上设置一个警报。 If you have multiple availability zones then you should be able to use the api to create an alarm slicing across availability zone and load balancer (again, one alarm per load balancer) but you can't do this from the web UI as far as I know. 如果您有多个可用区域，那么您应该能够使用api在可用区域和负载均衡器之间创建警报切片（同样，每个负载均衡器有一个警报），但是就我而言，您无法从Web UI执行此操作知道。