Kube.netes 在节点上达到 100% CPU，但在 pod 上没有

Question

我的 Kube.netes 集群（在 1.18 上运行）每天都遇到问题，其中一个节点 go 的 CPU 使用率超过 100%，并且 Kube.netes 无法将外部访问者连接到我的 pod。 （基本上是网站中断）

奇怪的是 pod 始终处于舒适的 30%（或更低）CPU 状态。 所以应用程序本身似乎没问题。

当我describe有问题的节点时，我看到提到node-problem-detector超时。

Events:
  Type     Reason                  Age                      From                                     Message
  ----     ------                  ---                      ----                                     -------
  Normal   NodeNotSchedulable      10m                      kubelet                                  Node nodepoo1-vmss000007 status is now: NodeNotSchedulable
  Warning  KubeletIsDown           9m44s (x63 over 5h21m)   kubelet-custom-plugin-monitor            Timeout when running plugin "/etc/node-problem-detector.d/plugin/check_kubelet.s"
  Warning  ContainerRuntimeIsDown  9m41s (x238 over 5h25m)  container-runtime-custom-plugin-monitor  Timeout when running plugin "/etc/node-problem-detector.d/plugin/check_runtime.s"

我目前的方法是在我的节点池上运行三个节点，并在监控中断期间通过封锁有问题的节点并将所有 pod 移动到其他节点之一来有效地照看 Kube.netes。 15 分钟后一切恢复正常，我将解除受影响的节点并重新开始循环。

这个周末我特别不走运，我在 24 小时内遇到了三个 CPU 峰值。

我如何解决这个问题 go？ Timeout when running plugin "/etc/node-problem-detector.d/plugin/check_kubelet.s"的任何信息。

Answer 1

您可以尝试打开到节点的ssh连接，然后使用top检查哪个进程消耗 CPU。 如果此进程在 pod 中运行并且您在节点上安装了crictl ，则可以使用https://github.com/k8s-school/pid2pod来检索正在运行该进程的 pod。

Answer 2

尝试查看您的periodSeconds和timeoutSeconds规范。 您的答案必须隐藏在这些规范中。

Kube.netes 在节点上达到 100% CPU，但在 pod 上没有

问题描述

2 个解决方案

解决方案1
1 2022-02-19 09:18:03

解决方案2
0 2022-02-23 11:18:58

Kube.netes 在节点上达到 100% CPU，但在 pod 上没有

问题描述

2 个解决方案

解决方案1 1 2022-02-19 09:18:03

解决方案2 0 2022-02-23 11:18:58

解决方案1
1 2022-02-19 09:18:03

解决方案2
0 2022-02-23 11:18:58