繁体   English   中英

Amazon Linux 2 worker 重启失败

[英]Amazon Linux 2 worker fails to reboot

我在连接到 SQS 的 Amazon Linux 2 worker 实例上运行 Node.js 应用程序。

问题

一切运行良好,只是出于技术原因我需要定期重启服务器。 为此,我设置了一个 cron 以在晚上运行/sbin/shutdown -r now

当实例重新启动时,我收到有关 SQS 守护程序服务的错误:

[INFO] Executing instruction: configureSqsd
[INFO] get sqsd conf from cfn metadata and write into sqsd conf file ...
[INFO] Executing instruction: startSqsd
[INFO] Running command /bin/sh -c systemctl show -p PartOf sqsd.service
[INFO] Running command /bin/sh -c systemctl is-active sqsd.service
[INFO] Running command /bin/sh -c systemctl start sqsd.service
[ERROR] An error occurred during execution of command [self-startup] - [startSqsd]. 
Stop running the command. Error: startProcess Failure: starting process "sqsd" failed: 
Command /bin/sh -c systemctl start sqsd.service failed with error exit status 1. 
Stderr:Job for sqsd.service failed because the control process exited with error code. 
See "systemctl status sqsd.service" and "journalctl -xe" for details.

然后该实例陷入一个循环,在该循环中初始化一直运行,直到遇到 sqsd.service 错误,然后重新开始。

日志

systemctl status sqsd.service命令似乎没有显示比我们已经得到的更多的信息,只是它以状态 1 退出:

● sqsd.service - This is sqsd daemon
   Loaded: loaded (/etc/systemd/system/sqsd.service; enabled; vendor preset: disabled)
   Active: deactivating (stop-sigterm) (Result: exit-code)
  Process: 2748 ExecStopPost=/bin/sh -c  (code=exited, status=0/SUCCESS)
  Process: 2745 ExecStopPost=/bin/sh -c rm -f /var/pids/sqsd.pid (code=exited, status=0/SUCCESS)
  Process: 2753 ExecStart=/bin/sh -c /opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd start (code=exited, status=1/FAILURE)
   CGroup: /system.slice/sqsd.service
           └─2789 /opt/elasticbeanstalk/lib/ruby/bin/ruby /opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd start

检查journalctl -xe时发现最有趣的是:

sqsd[9704]: /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.3/bin/aws-sqsd:58:in `initialize': No such file or directory @ rb_sysopen - /var/run/aws-sqsd/default.pid (Errno::ENOENT)
sqsd[9704]: from /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.3/bin/aws-sqsd:58:in `open'
sqsd[9704]: from /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.3/bin/aws-sqsd:58:in `start'
sqsd[9704]: from /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.3/bin/aws-sqsd:83:in `launch'
sqsd[9704]: from /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.3/bin/aws-sqsd:111:in `<top (required)>'
sqsd[9704]: from /opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd:23:in `load'
sqsd[9704]: from /opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd:23:in `<main>'
systemd[1]: sqsd.service: control process exited, code=exited status=1
systemd[1]: Failed to start This is sqsd daemon.

进一步的调查

根据日志,重新启动服务器时文件/var/run/aws-sqsd/default.pid不存在。 它在重建时确实存在并且包含应用程序进程 ID。

如果我添加文件,设置过程会更进一步,直到缺少类似的文件。

解决方案?

有没有人遇到过这个问题? 不知道为什么在正常重启后启动 sqsd.service 失败,但在初始部署和重建环境后工作正常......它几乎看起来像是在寻找一个不存在的配置文件......

是否有任何其他方法可以安全地重启我应该尝试的实例?

我有同样的问题。 不发布解决方案,而是发布有关该问题的更多数据。 我在 /var/log/messages 中发现错误,表明 SQSd 守护程序用完了 memory。

Apr 28 15:43:05 ip-172-31-121-3 sqsd: /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.4/bin/aws-sqsd:42:in `fork': Cannot allocate memory - fork(2) (Errno::ENOMEM)
Apr 28 15:43:05 ip-172-31-121-3 sqsd: from /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.4/bin/aws-sqsd:42:in `start'
Apr 28 15:43:05 ip-172-31-121-3 sqsd: from /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.4/bin/aws-sqsd:83:in `launch'
Apr 28 15:43:05 ip-172-31-121-3 sqsd: from /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.4/bin/aws-sqsd:111:in `<top (required)>'
Apr 28 15:43:05 ip-172-31-121-3 sqsd: from /opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd:23:in `load'
Apr 28 15:43:05 ip-172-31-121-3 sqsd: from /opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd:23:in `<main>'
Apr 28 15:43:05 ip-172-31-121-3 systemd: sqsd.service: control process exited, code=exited status=1
Apr 28 15:43:05 ip-172-31-121-3 systemd: Failed to start This is sqsd daemon.
Apr 28 15:43:05 ip-172-31-121-3 systemd: Unit sqsd.service entered failed state.
Apr 28 15:43:05 ip-172-31-121-3 systemd: sqsd.service failed.

在设置了一个更大的实例 class 之后,一切顺利,但我不确定这不仅仅是刷新的实例(如提到的 david.emilsson)还是额外的 memory。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM