简体繁体 English

在重新启动操作系统之前，如何检测Linux（Debian）中的守护程序守护程序中出现的问题

[英]How to detect issue occurs in watchdog daemon in Linux(Debian) before watchdog reboot the OS

原文 2016-08-20 14:50:00 8 1 c/ linux/ debian/ systems-programming

I am working on an application project on Debian Linux which involves software watchdog to monitors other services by PID file created by services. 我正在Debian Linux上开发一个应用程序项目，该项目涉及软件监视程序，以通过服务创建的PID文件监视其他服务。

I am following the steps from http://linux.die.net/man/5/watchdog.conf and installed it by 我正在按照http://linux.die.net/man/5/watchdog.conf中的步骤进行操作，并通过

apt-get install watchdog apt-get install看门狗

The mechanism behind is that watchdog checks these PID files existence those are configured in /etc/watchdog,conf file. 背后的机制是看门狗检查这些PID文件是否存在于/ etc / watchdog，conf文件中。

I have tested it by stopping any service by service service-name stop 我已经通过按服务service-name stop停止任何服务来测试了它

Watchdog will detect that service is not in running state hence it reboot the system after some seconds equal to watchdog timeout period. 看门狗将检测到服务未处于运行状态，因此它将在等于看门狗超时时间的几秒钟后重新启动系统。

Consider we have a display less product then it would rebooting the system infinite time without any intimation to end user in case of a service's configuration files are corrupted etc. 考虑到我们的显示器产品较少，那么在服务的配置文件损坏等情况下，它将无限制地重启系统，而不会影响最终用户。

The practical expectation is that before taking action by watchdog for reboot/halt/soft-restart I am want to know the status of watchdog so that programmer can implement intimation logic for end user. 实际的期望是，在看门狗采取措施进行重启/停止/软重启之前，我想知道看门狗的状态，以便程序员可以为最终用户实现提示逻辑。

Otherwise can it possible to modify watchdog init script in /etc/init.d/ to call user program on stopping the software watchdog so that programmer will able to maintain a counter in non-volatile memory to avoid infinite time reboot. 否则，可以修改/etc/init.d/中的看门狗初始化脚本，以在停止软件看门狗时调用用户程序，以便程序员能够在非易失性存储器中维护一个计数器，以避免无限时间重启。

Except above I want more about this software watchdog or watchdog daemon to get status. 除上述之外，我还想了解有关此软件看门狗或看门狗守护程序的更多信息，以获取状态。 I have implemented it to monitor services, CPU overload, temperature etc but I am not getting any event before watchdog action hence I am not getting why the system restarting due to a service down, CPU overheat or CPU overload etc. 我已经实现了它来监视服务，CPU过载，温度等，但是在看门狗操作之前没有收到任何事件，因此我无法理解为什么由于服务中断，CPU过热或CPU过载等原因而导致系统重启。

1 个解决方案

A watchdog is designed as a last resort to rescue a system after it has failed beyond recovery. 看门狗被设计为在无法恢复的故障后拯救系统的最后手段。 A hardware watchdog will physically reset the CPU, and is used to make sure that a system doesn't hang for long periods. 硬件看门狗将物理重置CPU，并用于确保系统长时间不挂起。

There is no way to receive a warning that this will happen in software because it's assumed that all software has failed. 由于假定所有软件都已失败，因此无法收到将在软件中发生的警告。

If you need a solution that detects that a process is no longer responding, you should make that separate from the watchdog. 如果您需要一种检测到进程不再响应的解决方案，则应将其与看门狗分开。

See the answers to this question for something similar: Designing a monitor process for monitoring and restarting processes 有关类似问题，请参见此问题的答案：设计用于监视和重新启动过程的监视过程