prometheus监控容器memory

Question

By monitoring the real memory used by the container, it is found that the real memory of all containers is larger than that of all physical nodes.通过监测容器使用的真实memory，发现所有容器的真实memory大于所有物理节点的memory。 This is very strange.这很奇怪。

However, I found in the monitored metrics that there was no container_ Name field, if no container is removed_ Name field.但是，我在监控的metrics中发现没有container_Name字段，如果没有移除container_Name字段。 Only at this time can we find that the actual memory of the container is reasonable这时候才能发现容器的实际memory是合理的

Why does this happen (PS: container_name! = "pod" is excluded为什么会出现这种情况（PS：container_name！= "pod" 被排除在外


sum(sum(container_memory_rss{container_name!="POD",container_name=~"[a-z].*"}) by (container_name))/1024^4


sum(sum(container_memory_rss{container_name!="POD") by (container_name))/1024^4

Answer 1

Here is what we use for mapping container memory metrics这是我们用于映射容器 memory 指标的方法

sum by (container, pod, namespace, node, job)(container_memory_rss{container,= "POD", image != "", container != ""})总和（容器，pod，命名空间，节点，作业）（container_memory_rss{container，=“POD”，图像！=“”，容器！=“”}）

To answer your specific question why the value is higher?要回答您的具体问题，为什么价值更高？ that's because it includes the node memory itself.那是因为它包括节点 memory 本身。

kubelet (cadvisor) reports memory metrics for multiple groups for example, id="/" is the metric for the root cgroup (ie for the entire node) kubelet (cadvisor) 报告多个组的 memory 指标，例如，id="/" 是根 cgroup 的指标（即整个节点）

eg In my setup the following metric is the node memory例如，在我的设置中，以下指标是节点 memory

{endpoint="https-metrics", id="/", instance="10.0.84.2:10250", job="kubelet", metrics_path="/metrics/cadvisor", node="ip-10-xx-x-x.us-west-2.compute.internal", service="kube-prometheus-stack-kubelet"}

Also at www.asserts.ai we use the max of rss, working and usage metrics, to arrive at the actual memory used by container.同样在www.asserts.ai ，我们使用 rss、工作和使用指标的最大值来得出容器使用的实际 memory。

see below a reference to our recording rule请参阅下面对我们的记录规则的参考

      
      #
      - record: asserts:container_memory
        expr: sum by (container, pod, namespace, node, job, asserts_env, asserts_site)(container_memory_rss{container != "POD", image != "", container != ""})
        labels:
          source: rss

      - record: asserts:container_memory
        expr: sum by (container, pod, namespace, node, job, asserts_env, asserts_site)(container_memory_working_set_bytes{container != "POD", image != "", container != ""})
        labels:
          source: working

      - record: asserts:container_memory
        # why sum ? multiple copies of same container may be running on same pod
        expr: sum by (container, pod, namespace, node, job, asserts_env, asserts_site)
          (
          container_memory_usage_bytes {container != "POD", image != "", container != ""} -
          container_memory_cache {container != "POD", image != "", container != ""}-
          container_memory_swap {container != "POD", image != "", container != ""}
          )
        labels:
          source: usage

      # For KPI Rollup Purposes
      - record: asserts:resource:usage
        expr: |-
          max without (source) (asserts:container_memory)
          * on (namespace, pod, asserts_env, asserts_site) group_left(workload) asserts:mixin_pod_workload

prometheus监控容器memory

问题描述

1 个解决方案

解决方案1
0 2021-12-08 09:08:37

prometheus监控容器memory

问题描述

1 个解决方案

解决方案1 0 2021-12-08 09:08:37

解决方案1
0 2021-12-08 09:08:37