使用 prometheus 计算 k8s 集群 cpu/内存使用情况

Question

I want to count k8s cluster cpu/memory usage (not k8s pod usage) with prometheus, so that i can show in grafana.我想用 prometheus 计算 k8s 集群 cpu/内存使用情况（不是 k8s pod 使用情况），以便我可以在 grafana 中显示。

I use sum (container_memory_usage_bytes{id="/"}) to get k8s cluster used memory, and topk(1, sum(kube_node_status_capacity_memory_bytes) by (instance)) to get whole k8s cluster memory, but they can not divide since topk function does not return value but vector.我使用sum (container_memory_usage_bytes{id="/"})来获取使用的 k8s 集群 memory，并使用topk(1, sum(kube_node_status_capacity_memory_bytes) by (instance))来获取整个 k8s 集群 memory，但它们无法划分，因为topk function不是返回值而是向量。

How can i do this?我怎样才能做到这一点？

Answer 1

I have installed Prometheus on google Cloud through the gcloud default applications. 我已经通过gcloud默认应用程序在Google Cloud上安装了Prometheus。 The dashboards automatically got deployed with the installation. 仪表板随安装自动部署。 The following queries are what was used for memory and CPU usage of the cluster: 以下查询是用于群集的内存和CPU使用率的查询：

CPU usage by namespace: 按名称空间划分的CPU使用率：

sum(irate(container_cpu_usage_seconds_total[1m])) by (namespace)

Memory usage (no cache) by namespace: 命名空间的内存使用情况（无缓存）：

sum(container_memory_rss) by (namespace)

CPU request commitment: CPU请求承诺：

sum(kube_pod_container_resource_requests_cpu_cores) / sum(node:node_num_cpu:sum)

Memory request commitment: 内存请求承诺：

sum(kube_pod_container_resource_requests_memory_bytes) / sum(node_memory_MemTotal)

Answer 2

The following query returns global memory usage for all the running pods in K8S:以下查询返回 K8S 中所有正在运行的 pod 的全局 memory 使用情况：

sum(container_memory_usage_bytes{container!=""})

This query uses sum() aggregate function for summing memory usage across all the containers, which run in K8S.此查询使用sum() 聚合 function来汇总在 K8S 中运行的所有容器的 memory 使用情况。

The container!="" filter is needed for filtering out redundant metrics related to cgroups hierarchy.需要container!=""过滤器来过滤掉与cgroups层次结构相关的冗余指标。 See this answer for details.有关详细信息，请参阅此答案。

The following query returns global memory usage for k8s cluster in percentage:以下查询以百分比形式返回 k8s 集群的全局 memory 使用率：

100 * (
  sum(container_memory_usage_bytes{container!=""})
    /
  sum(kube_node_status_capacity{resource="memory"})
)

Note that some nodes in K8S can have much higher memory usage in percentage than the other nodes because of scheduling policies.请注意，由于调度策略，K8S 中的某些节点的 memory 使用百分比可能比其他节点高得多。 The following query allows determining top 3 nodes with the maximum memory usage in percentage:以下查询允许确定使用百分比最大 memory 的前 3 个节点：

topk(3,
  100 * (
    sum(container_memory_usage_bytes{container!=""}) by (node)
      / on(node)
    kube_node_status_capacity{resource="memory"}
  )
)

This query uses topk function for limiting the number of returned time series to 3. Note that the query may return more than 3 time series on a graph in Grafana, since topk returns up to k unique time series per each point on the graph.此查询使用topk function 将返回的时间序列数限制为 3。请注意，查询可能会在 Grafana 中的图表上返回超过 3 个时间序列，因为topk会为图表上的每个点返回最多k个唯一时间序列。 If you need a graph with no more than k time series with the maximum values, then take a look at topk_* functions at MetricsQL such as topk_max , topk_avg or topk_last .如果您需要一个具有不超过k个最大值的时间序列的图表，请查看 MetricsQL 中的topk_ topk_*函数，例如topk_max 、 topk_avg或topk_last 。

The query also uses on() modifier for / operation.该查询还使用on()修饰符进行/操作。 This modifier limits the set of labels, which is used for finding time series pairs on the left and the right side of / with identical label values.此修饰符限制标签集，用于在/的左侧和右侧查找具有相同 label 值的时间序列对。 Then Prometheus applies the / operation individually per each such pair.然后 Prometheus 对每一对单独应用/操作。 See these docs for details.有关详细信息，请参阅这些文档。

The following query returns the number of CPU cores used by all the pods in Kube.netes:以下查询返回 Kube.netes 中所有 Pod 使用的 CPU 内核数：

sum(rate(container_cpu_usage_seconds_total{container!=""}[5m]))

The following query returns global CPU usage for k8s cluster in percentage:以下查询以百分比形式返回 k8s 集群的全局 CPU 使用率：

100 * (
  sum(rate(container_cpu_usage_seconds_total{container!=""}[5m]))
    /
  sum(kube_node_status_capacity{resource="cpu"})
)

Some nodes may be loaded much more than the rest of nodes in Kube.netes cluster.某些节点的负载可能远远超过 Kube.netes 集群中的 rest 个节点。 The following query returns top3 nodes with the highest CPU load:以下查询返回 CPU 负载最高的前 3 个节点：

topk(3,
  100 * (
    sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (node)
      / on(node)
    kube_node_status_capacity{resource="cpu"})
)

Answer 3

我的主要问题是topk(1, sum(kube_node_status_capacity_memory_bytes) by (instance))无法返回值，但是现在我发现使用sum()隐蔽它可以工作，整个查询如下：

sum(sum (container_memory_usage_bytes{id="/"})by (instance))/sum(topk(1, sum(kube_node_status_capacity_memory_bytes) by (instance)))*100

使用 prometheus 计算 k8s 集群 cpu/内存使用情况

问题描述

3 个解决方案

解决方案1
2 2019-02-25 16:17:05

解决方案2
1 2022-04-28 11:05:11

解决方案3
0 已采纳 2019-02-26 02:28:21

使用 prometheus 计算 k8s 集群 cpu/内存使用情况

问题描述

3 个解决方案

解决方案1 2 2019-02-25 16:17:05

解决方案2 1 2022-04-28 11:05:11

解决方案3 0 已采纳 2019-02-26 02:28:21

解决方案1
2 2019-02-25 16:17:05

解决方案2
1 2022-04-28 11:05:11

解决方案3
0 已采纳 2019-02-26 02:28:21