[英]Amazon EC2 Servers getting frozen sporadically
I've been working with Amazon EC2 servers for 3+ years and I noticed a recurrent behaviour: some servers get frozen sporadically (between 1 to 5 times by year).我已经使用 Amazon EC2 服务器超过 3 年,我注意到一个反复出现的行为:一些服务器偶尔会被冻结(每年 1 到 5 次)。
I like Amazon EC2 Servers, mainly because Amazon has a lot of useful additional services (like SES), but this behaviour is really frustating.我喜欢 Amazon EC2 Servers,主要是因为 Amazon 有很多有用的附加服务(比如 SES),但是这种行为真的很令人沮丧。 Sometimes I got customers calls complaining about systems down and I just need an instance restart to solve the problem.有时,我接到客户电话抱怨系统宕机,我只需要重启实例即可解决问题。
Does anybody have a tip about solving this problem?有没有人有解决这个问题的提示?
UPDATE 1更新 1
UPDATE 2更新 2
There is a CPU utilization peak in the logs, near the time when server was down.日志中出现 CPU 使用率峰值,接近服务器停机时间。 It was at 3AM.当时是凌晨 3 点。 At this time there is a daily crontab task to make a database backup.这时候每天有一个crontab任务来做数据库备份。 But, considering this task runs everyday, why just sometimes it would make server get frozen?但是,考虑到这个任务每天都在运行,为什么有时会导致服务器冻结?
I have not seen this exact issue, but on any cloud platform I assume any instance can fail at any time, so we design for failure.我还没有看到这个确切的问题,但在任何云平台上,我假设任何实例都可能在任何时候失败,所以我们为失败而设计。 For example we have autoscaling on all customer facing instances.例如,我们对所有面向客户的实例都有自动缩放功能。 Anytime an instance fails, it is automatically replaced.任何时候实例失败,它都会被自动替换。
If a customer is calling to advise you a server is down, you may need to consider more automated methods of monitoring instance health and take automated action to recover the instance.如果客户打电话通知您服务器已关闭,您可能需要考虑使用更自动化的方法来监控实例运行状况并采取自动化操作来恢复实例。
CloudWatch also has server recovery actions available that can be trigger if certain metric thresholds are reached. CloudWatch 还具有可用的服务器恢复操作,如果达到特定指标阈值,可以触发这些操作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.