GKE 容器被“Memory cgroup out of memory”杀死，但监控、本地测试和 pprof 显示使用量远低于限制

Question

I recently pushed a new container image to one of my GKE deployments and noticed that API latency went up and requests started returning 502's.我最近将一个新的容器映像推送到我的一个 GKE 部署中，并注意到 API 延迟上升并且请求开始返回 502。

Looking at the logs I found that the container started crashing because of OOM:查看日志我发现容器由于OOM而开始崩溃：

Memory cgroup out of memory: Killed process 2774370 (main) total-vm:1801348kB, anon-rss:1043688kB, file-rss:12884kB, shmem-rss:0kB, UID:0 pgtables:2236kB oom_score_adj:980

Looking at the memory usage graph it didn't look like the pods were using more than 50MB of memory combined.查看 memory 使用图，看起来这些 pod 使用的 memory 总和不超过 50MB。 My original resource requests were:我最初的资源请求是：

...
spec:
...
  template:
...
    spec:
...
      containers:
      - name: api-server
...
        resources:
          # You must specify requests for CPU to autoscale
          # based on CPU utilization
          requests:
            cpu: "150m"
            memory: "80Mi"
          limits:
            cpu: "1"
            memory: "1024Mi"
      - name: cloud-sql-proxy
        # It is recommended to use the latest version of the Cloud SQL proxy
        # Make sure to update on a regular schedule!
        image: gcr.io/cloudsql-docker/gce-proxy:1.17
        resources:
          # You must specify requests for CPU to autoscale
          # based on CPU utilization
          requests:
            cpu: "100m"
...

Then I tried bumping the request for API server to 1GB but it did not help.然后我尝试将 API 服务器的请求增加到 1GB，但没有帮助。 In the end, what helped was reverting the container image to the previous version:最后，帮助将容器映像恢复到以前的版本：

Looking through the changes in the golang binary there are no obvious memory leaks.查看 golang 二进制文件的变化，没有明显的 memory 泄漏。 When I run it locally it uses at most 80MB of memory, even under load from the same requests as in production.当我在本地运行它时，它最多使用 80MB 的 memory，即使在与生产中相同的请求负载下也是如此。

And the above graph which I got from the GKE console also shows the pod using far less than the 1GB memory limit.我从 GKE 控制台获得的上图也显示了 pod 使用的容量远低于 1GB memory 限制。

So my question is: What could cause GKE to kill my process for OOM when both GKE monitoring and running it locally only uses 80MB out of the 1GB limit?所以我的问题是：当 GKE 监控和本地运行仅使用 1GB 限制中的 80MB 时，什么可能导致 GKE 终止我的 OOM 进程？

=== EDIT === === 编辑 ===

Adding another graph of the same outage.添加另一个相同中断的图表。 This time splitting the two containers in the pod.这次拆分 pod 中的两个容器。 If I understand correctly, the metric here is non-evictable container/memory/used_bytes :如果我理解正确，这里的指标是non-evictable container/memory/used_bytes ：

container/memory/used_bytes GA
Memory usage
GAUGE, INT64, By
k8s_container   Memory usage in bytes. Sampled every 60 seconds.
memory_type: Either `evictable` or `non-evictable`. Evictable memory is memory that can be easily reclaimed by the kernel, while non-evictable memory cannot.

Edit Apr 26 2021编辑 2021 年 4 月 26 日

I tried updating the resources field in the deployment yaml to 1GB RAM requested and 1GB RAM limit like suggested by Paul and Ryan:我尝试将部署 yaml 中的资源字段更新为请求的 1GB RAM 和 Paul 和 Ryan 建议的 1GB RAM 限制：

        resources:
          # You must specify requests for CPU to autoscale
          # based on CPU utilization
          requests:
            cpu: "150m"
            memory: "1024Mi"
          limits:
            cpu: "1"
            memory: "1024Mi"

Unfortunately it had the same result after updating with kubectl apply -f api_server_deployment.yaml :不幸的是，使用kubectl apply -f api_server_deployment.yaml更新后，结果相同：

{
 insertId: "yyq7u3g2sy7f00"  
 jsonPayload: {
  apiVersion: "v1"   
  eventTime: null   
  involvedObject: {
   kind: "Node"    
   name: "gke-api-us-central-1-e2-highcpu-4-nod-dfe5c3a6-c0jy"    
   uid: "gke-api-us-central-1-e2-highcpu-4-nod-dfe5c3a6-c0jy"    
  }
  kind: "Event"   
  message: "Memory cgroup out of memory: Killed process 1707107 (main) total-vm:1801412kB, anon-rss:1043284kB, file-rss:9732kB, shmem-rss:0kB, UID:0 pgtables:2224kB oom_score_adj:741"   
  metadata: {
   creationTimestamp: "2021-04-26T23:13:13Z"    
   managedFields: [
    0: {
     apiVersion: "v1"      
     fieldsType: "FieldsV1"      
     fieldsV1: {
      f:count: {
      }
      f:firstTimestamp: {
      }
      f:involvedObject: {
       f:kind: {
       }
       f:name: {
       }
       f:uid: {
       }
      }
      f:lastTimestamp: {
      }
      f:message: {
      }
      f:reason: {
      }
      f:source: {
       f:component: {
       }
       f:host: {
       }
      }
      f:type: {
      }
     }
     manager: "node-problem-detector"      
     operation: "Update"      
     time: "2021-04-26T23:13:13Z"      
    }
   ]
   name: "gke-api-us-central-1-e2-highcpu-4-nod-dfe5c3a6-c0jy.16798b61e3b76ec7"    
   namespace: "default"    
   resourceVersion: "156359"    
   selfLink: "/api/v1/namespaces/default/events/gke-api-us-central-1-e2-highcpu-4-nod-dfe5c3a6-c0jy.16798b61e3b76ec7"    
   uid: "da2ad319-3f86-4ec7-8467-e7523c9eff1c"    
  }
  reason: "OOMKilling"   
  reportingComponent: ""   
  reportingInstance: ""   
  source: {
   component: "kernel-monitor"    
   host: "gke-api-us-central-1-e2-highcpu-4-nod-dfe5c3a6-c0jy"    
  }
  type: "Warning"   
 }
 logName: "projects/questions-279902/logs/events"  
 receiveTimestamp: "2021-04-26T23:13:16.918764734Z"  
 resource: {
  labels: {
   cluster_name: "api-us-central-1"    
   location: "us-central1-a"    
   node_name: "gke-api-us-central-1-e2-highcpu-4-nod-dfe5c3a6-c0jy"    
   project_id: "questions-279902"    
  }
  type: "k8s_node"   
 }
 severity: "WARNING"  
 timestamp: "2021-04-26T23:13:13Z"  
}

Kubernetes seems to have almost immediately killed the container for using 1GB of memory. Kubernetes 似乎几乎立即杀死了使用 1GB memory 的容器。 But again, the metrics show that container using only 2MB of memory:但同样，指标显示该容器仅使用 2MB 的 memory：

And again I am stumped because even under load this binary does not use more than 80MB when I run it locally.我再次感到难过，因为即使在负载下，当我在本地运行它时，这个二进制文件也不会使用超过 80MB。

I also tried running go tool pprof <url>/debug/pprof/heap .我还尝试运行go tool pprof <url>/debug/pprof/heap 。 It showed several different values as Kubernetes kept thrashing the container.它显示了几个不同的值，因为 Kubernetes 不断颠簸容器。 But none higher than ~20MB and not memory usage out of the ordinary但没有一个高于 ~20MB 并且没有 memory 使用异常

Edit 04/27编辑 04/27

I tried setting request=limit for both containers in the pod:我尝试为 pod 中的两个容器设置 request=limit：

 requests:
   cpu: "1"
   memory: "1024Mi"
 limits:
   cpu: "1"
   memory: "1024Mi"
...
requests:
  cpu: "100m"
  memory: "200Mi"
limits:
  cpu: "100m"
  memory: "200Mi"

But it didn't work either:但它也没有工作：

Memory cgroup out of memory: Killed process 2662217 (main) total-vm:1800900kB, anon-rss:1042888kB, file-rss:10384kB, shmem-rss:0kB, UID:0 pgtables:2224kB oom_score_adj:-998

And the memory metrics still show usage in the single digit MBs. memory 指标仍显示单位数 MB 的使用情况。

Update 04/30 04/30 更新

I pinpointed the change that seemed to cause this issue by painstakingly checking out my latest commits one by one.我通过煞费苦心地逐一检查我的最新提交来确定似乎导致此问题的更改。

In the offending commit I had a couple of lines like在有问题的提交中，我有几行像

type Pic struct {
        image.Image
        Proto *pb.Image
}
...

pic.Image = picture.Resize(pic, sz.Height, sz.Width)
...

Where picture.Resize eventually calls resize.Resize .其中picture.Resize最终调用resize.Resize 。 I changed it to:我将其更改为：

type Pic struct {
        Img   image.Image
        Proto *pb.Image
 }
...
pic.Img = picture.Resize(pic.Img, sz.Height, sz.Width)

This solves my immediate problem and the container runs fine now.这解决了我的直接问题，容器现在运行良好。 But it does not answer my original question:但它没有回答我原来的问题：

Why did these lines cause GKE to OOM my container?为什么这些行会导致 GKE OOM 我的容器？
And why did the GKE memory metrics show that everything was fine?为什么 GKE memory 指标显示一切正常？

Answer 1

I guess it was caused by Pod QoS class我猜这是由Pod QoS class引起的

When the system is overcommitted, the QoS classes determine which pod gets killed first so the freed resources can be given to higher priority pods.当系统过度使用时，QoS 类确定首先杀死哪个 pod，以便可以将释放的资源分配给更高优先级的 pod。

In your case, the QoS of your pod would be Burstable在您的情况下，您的 pod 的 QoS 将是Burstable

Each running process has an OutOfMemory(OOM) score.每个正在运行的进程都有一个 OutOfMemory(OOM) 分数。 The system selects the process to kill by comparing OOM score of all the running processes.系统通过比较所有正在运行的进程的OOM分数来选择要杀死的进程。 When memory needs to be freed, the process with the highest score gets killed.当 memory 需要被释放时，得分最高的进程将被杀死。 For details of how the score is calculated please refer to How is kernel oom score calculated? score计算方法请参考kernel oom 分数是如何计算的？ . .

Which pod will be killed first if both in the Burstable class?如果两者都在Burstable class 中，哪个 pod 将首先被杀死？

For short, the system will kill the one using more of its requested memory than the other in percentage-wise.简而言之，系统将以百分比方式杀死使用更多请求的 memory 的一个。

Pod A

used: 90m
requests: 100m
limits: 200m

Pod B

used: 150m
requests: 200m
limits: 400m

Pod A will get killed before Pod B because it uses 90% of its requested memory while Pod B use only 75% of its requested memory. Pod A将在Pod B之前被杀死，因为它使用了请求的 memory 的 90%，而Pod B仅使用了请求的 memory 的 75%。

Answer 2

The resource spec here is the root cause for the OOM.这里的资源规范是 OOM 的根本原因。

In Kubernetes, required and limited memory are defined differently.在 Kubernetes 中，必需和受限 memory 的定义不同。 Required memory is the memory must-have .必选 memory 是 memory must-have 。 Limited memory is the memory that the container can be bursted into.限定memory是容器可以爆破的memory。 But limited memory does not guarantee that the container can have that resources.但是有限的 memory 并不能保证容器可以拥有该资源。

In most of the production systems, it is not recommended that the limited and required resource differ too much.在大多数生产系统中，不建议有限资源和所需资源相差太大。 For example, in your case,例如，在您的情况下，

requests:
  cpu: "150m"
  memory: "80Mi"
limits:
  cpu: "1"
  memory: "1024Mi"

The container can only have 80Mi guaranteed memory but it can somehow burst into 1024Mi.容器只能保证 80Mi memory 但它可以以某种方式爆发到 1024Mi。 The node may not have enough memory for the container and container itself will go into OOM.节点可能没有足够的 memory 供容器使用，容器本身会 go 进入 OOM。

So, if you want to improve this situation, you need to configure the resource to be something like this.所以，如果你想改善这种情况，你需要把资源配置成这样。

requests:
  cpu: "150m"
  memory: "1024Mi"
limits:
  cpu: "1"
  memory: "1024Mi"

Please note that CPU is just fine because you won't get the process killed under low CPU time.请注意，CPU 很好，因为您不会在低 CPU 时间下杀死进程。 But the OOM will lead to the process killed.但是OOM会导致进程被杀。

As the answer above mentioned, this is related to the quality of service in the pod.正如上面提到的答案，这与吊舱中的服务质量有关。 In general, to most of the end user, you should always configure your container as guaranteed class, ie requested == limited.一般来说，对于大多数最终用户，您应该始终将您的容器配置为保证 class，即请求 == 受限。 You may need to have some justification before configuring it as bursted class.在将其配置为突发 class 之前，您可能需要有一些理由。

GKE 容器被“Memory cgroup out of memory”杀死，但监控、本地测试和 pprof 显示使用量远低于限制

问题描述

Edit Apr 26 2021编辑 2021 年 4 月 26 日

Edit 04/27编辑 04/27

Update 04/30 04/30 更新

2 个解决方案

解决方案1
1 2021-04-26 05:40:55

解决方案2
1 2021-04-26 06:04:35

GKE 容器被“Memory cgroup out of memory”杀死，但监控、本地测试和 pprof 显示使用量远低于限制

问题描述

Edit Apr 26 2021编辑 2021 年 4 月 26 日

Edit 04/27编辑 04/27

Update 04/30 04/30 更新

2 个解决方案

解决方案1 1 2021-04-26 05:40:55

解决方案2 1 2021-04-26 06:04:35

解决方案1
1 2021-04-26 05:40:55

解决方案2
1 2021-04-26 06:04:35