簡體   English   中英

對 cgroup 進程的 docker 間歇性故障問題進行故障排除

[英]Troubleshooting docker intermitted failure issue for cgroup process

讓我試着在這里描述一下我的情況(盡量捕捉我所擁有的任何信息)。

我們有一個生產級服務,它由許多包含在雲(asuzre)VM 中運行的多個服務的 docker 組成。

現在,如果我們作為 Longivity 測試的一部分繼續長時間運行它(long >= 5 天),我們可以看到 - 有時(即並非總是在 5 天后,有時) - 服務開始失敗,拒絕向我們的客戶提供服務.

ERROR: for health-checker  Cannot start service health-checker: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:297: applying cgroup configuration for process caused \"failed to write 1 to memory.kmem.limit_in_bytes: write /sys/fs/cgroup/memory/docker/ad4926b8e5b583ce3ae30d4e3d1f1379ee89fc2735d83a87b127ef4e1e7089db/memory.kmem.limit_in_bytes: cannot allocate memory\"": unknown {}

ERROR: for credentials  Cannot start service credentials: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:297: applying cgroup configuration for process caused \"failed to write 1 to memory.kmem.limit_in_bytes: write /sys/fs/cgroup/memory/docker/5b2cef0997776af7265fcc41bd640059a29fc723375e43acde63514f58ec6055/memory.kmem.limit_in_bytes: cannot allocate memory\"": unknown {}

ERROR: for occm  Cannot start service occm: runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:297: applying cgroup configuration for process caused \"failed to write 1 to memory.kmem.limit_in_bytes: write /sys/fs/cgroup/memory/docker/9d5912c7459a514c6f9bdaa3a170b1bf0ba4fa3189b482b72c2013a85cf5b8ba/memory.kmem.limit_in_bytes: cannot allocate memory\"": unknown {}

failed to perform container upgrade task. java.lang.RuntimeException: Failed to deploy containers {akkaAddress=akka://some-manager, akkaSource=akka://some-manager/user/service-deployer, sourceActorSystem=some-manager}

因此,我們的任何服務都無法訪問,所有的 https 調用都被拒絕:

Name does not resolve {}\n","stream":"stdout","time":"2021-07-02T03:38:29.720361925Z"}

Name does not resolve {}\n","stream":"stdout","time":"2021-07-02T03:38:29.744298675Z"}

我試圖做很多谷歌,並試圖從哪里開始做一些可操作和有意義的事情。

任何指針/見解/線索將不勝感激。

(我知道我可能不是非常詳細或非常准確地指出問題 - 實際上我有點無能為力,因為它有時會在運行 5 天后失敗。)

尋求指導。 普拉迪普

升級內核后重建你的 docker & containerd。

這發生在我一次升級 5.4.6 -> 5.18.5 之后。 重建 docker & containerd 包解決了它。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM