Kubernetes Pod 已終止 - 退出代碼 137

Question

對於我在使用 k8s 1.14 並在其上運行 gitlab 管道時遇到的問題，我需要一些建議。 許多作業都拋出退出代碼 137 錯誤，我發現這意味着容器被突然終止。

集群信息：

Kubernetes 版本：1.14 正在使用的雲：AWS EKS 節點：C5.4xLarge

深入挖掘后，我發現了以下日志：

**kubelet: I0114 03:37:08.639450**  4721 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%).

**kubelet: E0114 03:37:08.653132**  4721 kubelet.go:1282] Image garbage collection failed once. Stats initialization may not have completed yet: failed to garbage collect required amount of images. Wanted to free 3022784921 bytes, but freed 0 bytes

**kubelet: W0114 03:37:23.240990**  4721 eviction_manager.go:397] eviction manager: timed out waiting for pods runner-u4zrz1by-project-12123209-concurrent-4zz892_gitlab-managed-apps(d9331870-367e-11ea-b638-0673fa95f662) to be cleaned up

**kubelet: W0114 00:15:51.106881**   4781 eviction_manager.go:333] eviction manager: attempting to reclaim ephemeral-storage

**kubelet: I0114 00:15:51.106907**   4781 container_gc.go:85] attempting to delete unused containers

**kubelet: I0114 00:15:51.116286**   4781 image_gc_manager.go:317] attempting to delete unused images

**kubelet: I0114 00:15:51.130499**   4781 eviction_manager.go:344] eviction manager: must evict pod(s) to reclaim ephemeral-storage 

**kubelet: I0114 00:15:51.130648**   4781 eviction_manager.go:362] eviction manager: pods ranked for eviction:

 1. runner-u4zrz1by-project-10310692-concurrent-1mqrmt_gitlab-managed-apps(d16238f0-3661-11ea-b638-0673fa95f662)
 2. runner-u4zrz1by-project-10310692-concurrent-0hnnlm_gitlab-managed-apps(d1017c51-3661-11ea-b638-0673fa95f662)

 3. runner-u4zrz1by-project-13074486-concurrent-0dlcxb_gitlab-managed-apps(63d78af9-3662-11ea-b638-0673fa95f662)

 4. prometheus-deployment-66885d86f-6j9vt_prometheus(da2788bb-3651-11ea-b638-0673fa95f662)

 5. nginx-ingress-controller-7dcc95dfbf-ld67q_ingress-nginx(6bf8d8e0-35ca-11ea-b638-0673fa95f662)

然后 pod 被終止，導致退出代碼 137s。

誰能幫我理解克服這個問題的原因和可能的解決方案？

謝謝：）

Answer 1

退出代碼 137 並不一定意味着 OOMKilled。 它表示容器收到 SIGKILL 失敗（一些中斷或“oom-killer”[OUT-OF-MEMORY]）

如果 pod 被 OOMKilled，當你描述 pod 時你會看到下面的一行

      State:        Terminated
      Reason:       OOMKilled

我見過類似的錯誤，但無法找出根本原因，對我們來說原因提到： Error

Answer 2

此錯誤代碼的典型原因可能是系統內存不足，或者健康檢查失敗

Answer 3

能夠解決問題。

節點最初有 20G 的 ebs 卷和 c5.4xlarge 實例類型。 我將 ebs 增加到 50 和 100G，但這並沒有幫助，因為我一直看到以下錯誤：

“映像文件系統上的磁盤使用率為 95%，超過了高閾值 (85%)。嘗試將 3022784921 字節釋放到低閾值 (80%)。”

然后我將實例類型更改為 c5d.4xlarge，它具有 400GB 的緩存存儲並提供 300GB 的 EBS。 這解決了錯誤。

一些 gitlab 作業是針對一些占用大量緩存空間並寫入大量日志的 Java 應用程序的。

Answer 4

137 表示 k8s 出於某種原因殺死了容器（可能是它沒有通過活性探測）

cod 137 is 128 + 9(SIGKILL) 進程被外部信號殺死

Answer 5

檢查 Jenkins 的主節點內存和 CPU 配置文件。 就我而言，它是內存和 CPU 利用率較高的主節點，而從節點則以 137 重新啟動。

Answer 6

我遇到了這個問題，這不是因為 OOM，而是 AWS 現貨實例被 AWS 中斷了。

Answer 7

詳細退出代碼 137

它表示進程被external signal終止。
數字 137 是兩個數字的和：128+x，# 其中 x 是發送到導致進程終止的進程的信號編號。
在示例中，x 等於 9，這是SIGKILL信號的編號，表示進程被強制終止。

希望這有助於更好。

Kubernetes Pod 已終止 - 退出代碼 137

問題描述

6 個解決方案

解決方案1
18 2020-01-16 06:20:38

解決方案2
4 2020-10-13 01:30:12

解決方案3
3 已采納 2020-01-16 06:12:54

解決方案4
1 2020-10-23 07:12:51

解決方案5
0 2021-08-19 10:45:32

解決方案6
0 2021-12-06 09:41:22

解決方案7
0 2022-09-23 14:06:33

Kubernetes Pod 已終止 - 退出代碼 137

問題描述

6 個解決方案

解決方案1 18 2020-01-16 06:20:38

解決方案2 4 2020-10-13 01:30:12

解決方案3 3 已采納 2020-01-16 06:12:54

解決方案4 1 2020-10-23 07:12:51

解決方案5 0 2021-08-19 10:45:32

解決方案6 0 2021-12-06 09:41:22

解決方案7 0 2022-09-23 14:06:33

解決方案1
18 2020-01-16 06:20:38

解決方案2
4 2020-10-13 01:30:12

解決方案3
3 已采納 2020-01-16 06:12:54

解決方案4
1 2020-10-23 07:12:51

解決方案5
0 2021-08-19 10:45:32

解決方案6
0 2021-12-06 09:41:22

解決方案7
0 2022-09-23 14:06:33