简体   繁体   English

prometheus监控容器memory

[英]prometheus monitor container memory

By monitoring the real memory used by the container, it is found that the real memory of all containers is larger than that of all physical nodes.通过监测容器使用的真实memory,发现所有容器的真实memory大于所有物理节点的memory。 This is very strange.这很奇怪。

However, I found in the monitored metrics that there was no container_ Name field, if no container is removed_ Name field.但是,我在监控的metrics中发现没有container_Name字段,如果没有移除container_Name字段。 Only at this time can we find that the actual memory of the container is reasonable这时候才能发现容器的实际memory是合理的

Why does this happen (PS: container_name! = "pod" is excluded为什么会出现这种情况(PS:container_name!= "pod" 被排除在外


sum(sum(container_memory_rss{container_name!="POD",container_name=~"[a-z].*"}) by (container_name))/1024^4

sum(sum(container_memory_rss{container_name!="POD") by (container_name))/1024^4 

Here is what we use for mapping container memory metrics这是我们用于映射容器 memory 指标的方法

sum by (container, pod, namespace, node, job)(container_memory_rss{container,= "POD", image != "", container != ""})总和(容器,pod,命名空间,节点,作业)(container_memory_rss{container,=“POD”,图像!=“”,容器!=“”})

To answer your specific question why the value is higher?要回答您的具体问题,为什么价值更高? that's because it includes the node memory itself.那是因为它包括节点 memory 本身。

kubelet (cadvisor) reports memory metrics for multiple groups for example, id="/" is the metric for the root cgroup (ie for the entire node) kubelet (cadvisor) 报告多个组的 memory 指标,例如,id="/" 是根 cgroup 的指标(即整个节点)

eg In my setup the following metric is the node memory例如,在我的设置中,以下指标是节点 memory

{endpoint="https-metrics", id="/", instance="10.0.84.2:10250", job="kubelet", metrics_path="/metrics/cadvisor", node="ip-10-xx-x-x.us-west-2.compute.internal", service="kube-prometheus-stack-kubelet"}

Also at www.asserts.ai we use the max of rss, working and usage metrics, to arrive at the actual memory used by container.同样在www.asserts.ai ,我们使用 rss、工作和使用指标的最大值来得出容器使用的实际 memory。

see below a reference to our recording rule请参阅下面对我们的记录规则的参考

      
      #
      - record: asserts:container_memory
        expr: sum by (container, pod, namespace, node, job, asserts_env, asserts_site)(container_memory_rss{container != "POD", image != "", container != ""})
        labels:
          source: rss

      - record: asserts:container_memory
        expr: sum by (container, pod, namespace, node, job, asserts_env, asserts_site)(container_memory_working_set_bytes{container != "POD", image != "", container != ""})
        labels:
          source: working

      - record: asserts:container_memory
        # why sum ? multiple copies of same container may be running on same pod
        expr: sum by (container, pod, namespace, node, job, asserts_env, asserts_site)
          (
          container_memory_usage_bytes {container != "POD", image != "", container != ""} -
          container_memory_cache {container != "POD", image != "", container != ""}-
          container_memory_swap {container != "POD", image != "", container != ""}
          )
        labels:
          source: usage

      # For KPI Rollup Purposes
      - record: asserts:resource:usage
        expr: |-
          max without (source) (asserts:container_memory)
          * on (namespace, pod, asserts_env, asserts_site) group_left(workload) asserts:mixin_pod_workload


声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM