简体繁体 English

Amazon EC2 服务器偶尔被冻结

[英]Amazon EC2 Servers getting frozen sporadically

原文 2020-02-01 11:15:58 3 1 amazon-web-services/ amazon-ec2/ ubuntu-16.04/ digital-ocean

I've been working with Amazon EC2 servers for 3+ years and I noticed a recurrent behaviour: some servers get frozen sporadically (between 1 to 5 times by year).我已经使用 Amazon EC2 服务器超过 3 年，我注意到一个反复出现的行为：一些服务器偶尔会被冻结（每年 1 到 5 次）。

When this fact ocurs, I can't connect to server (tried http, mysql and ssh connections) till server be restarted.当这个事实发生时，我无法连接到服务器（尝试过 http、mysql 和 ssh 连接）直到服务器重新启动。
The server back to work after a restart.重启后服务器恢复工作。
Sometimes the server goes online by 6+ months, sometimes the server get frozen about 1 month after a restart.有时服务器上线超过 6 个月，有时服务器在重启后大约 1 个月冻结。
All servers I noticed this behavior were micro instances (North Virginia and Sao Paulo).我注意到这种行为的所有服务器都是微型实例（北弗吉尼亚和圣保罗）。
The servers have an ordinary Apache 2, Mysql 5, PHP 7 environment, with Ubuntu 16 or 18. The PHP/MySQL Web application is not CPU intensive and is not accessed by more than 30 users/hour.服务器具有普通的 Apache 2、Mysql 5、PHP 7 环境，使用 Ubuntu 16 或 18。PHP/MySQL Web 应用程序不是 CPU 密集型的，并且不会被超过 30 个用户/小时访问。
The same environment and application on Digital Ocean servers does NOT reproduce the behaviour (I have two digital ocean servers running uninterrupted for 2+ years). Digital Ocean 服务器上的相同环境和应用程序不会重现该行为（我有两台 Digital Ocean 服务器不间断运行 2 年以上）。

I like Amazon EC2 Servers, mainly because Amazon has a lot of useful additional services (like SES), but this behaviour is really frustating.我喜欢 Amazon EC2 Servers，主要是因为 Amazon 有很多有用的附加服务（比如 SES），但是这种行为真的很令人沮丧。 Sometimes I got customers calls complaining about systems down and I just need an instance restart to solve the problem.有时，我接到客户电话抱怨系统宕机，我只需要重启实例即可解决问题。

Does anybody have a tip about solving this problem?有没有人有解决这个问题的提示？

UPDATE 1更新 1

They are t2.micro instances (1Gb RAM, 1 vCPU).它们是 t2.micro 实例（1Gb RAM，1 个 vCPU）。
MySQL SHOW GLOBAL VARIABLES: pastebin.com/m65ieAAb MySQL 显示全局变量：pastebin.com/m65ieAAb

UPDATE 2更新 2

There is a CPU utilization peak in the logs, near the time when server was down.日志中出现 CPU 使用率峰值，接近服务器停机时间。 It was at 3AM.当时是凌晨 3 点。 At this time there is a daily crontab task to make a database backup.这时候每天有一个crontab任务来做数据库备份。 But, considering this task runs everyday, why just sometimes it would make server get frozen?但是，考虑到这个任务每天都在运行，为什么有时会导致服务器冻结？

1 个解决方案

I have not seen this exact issue, but on any cloud platform I assume any instance can fail at any time, so we design for failure.我还没有看到这个确切的问题，但在任何云平台上，我假设任何实例都可能在任何时候失败，所以我们为失败而设计。 For example we have autoscaling on all customer facing instances.例如，我们对所有面向客户的实例都有自动缩放功能。 Anytime an instance fails, it is automatically replaced.任何时候实例失败，它都会被自动替换。

If a customer is calling to advise you a server is down, you may need to consider more automated methods of monitoring instance health and take automated action to recover the instance.如果客户打电话通知您服务器已关闭，您可能需要考虑使用更自动化的方法来监控实例运行状况并采取自动化操作来恢复实例。

CloudWatch also has server recovery actions available that can be trigger if certain metric thresholds are reached. CloudWatch 还具有可用的服务器恢复操作，如果达到特定指标阈值，可以触发这些操作。