count k8s cluster cpu/memory usage with prometheus

Question

I want to count k8s cluster cpu/memory usage (not k8s pod usage) with prometheus, so that i can show in grafana.

I use sum (container_memory_usage_bytes{id="/"}) to get k8s cluster used memory, and topk(1, sum(kube_node_status_capacity_memory_bytes) by (instance)) to get whole k8s cluster memory, but they can not divide since topk function does not return value but vector.

How can i do this?

Answer 1

I have installed Prometheus on google Cloud through the gcloud default applications. The dashboards automatically got deployed with the installation. The following queries are what was used for memory and CPU usage of the cluster:

CPU usage by namespace:

sum(irate(container_cpu_usage_seconds_total[1m])) by (namespace)

Memory usage (no cache) by namespace:

sum(container_memory_rss) by (namespace)

CPU request commitment:

sum(kube_pod_container_resource_requests_cpu_cores) / sum(node:node_num_cpu:sum)

Memory request commitment:

sum(kube_pod_container_resource_requests_memory_bytes) / sum(node_memory_MemTotal)

Answer 2

The following query returns global memory usage for all the running pods in K8S:

sum(container_memory_usage_bytes{container!=""})

This query uses sum() aggregate function for summing memory usage across all the containers, which run in K8S.

The container!="" filter is needed for filtering out redundant metrics related to cgroups hierarchy. See this answer for details.

The following query returns global memory usage for k8s cluster in percentage:

100 * (
  sum(container_memory_usage_bytes{container!=""})
    /
  sum(kube_node_status_capacity{resource="memory"})
)

Note that some nodes in K8S can have much higher memory usage in percentage than the other nodes because of scheduling policies. The following query allows determining top 3 nodes with the maximum memory usage in percentage:

topk(3,
  100 * (
    sum(container_memory_usage_bytes{container!=""}) by (node)
      / on(node)
    kube_node_status_capacity{resource="memory"}
  )
)

This query uses topk function for limiting the number of returned time series to 3. Note that the query may return more than 3 time series on a graph in Grafana, since topk returns up to k unique time series per each point on the graph. If you need a graph with no more than k time series with the maximum values, then take a look at topk_* functions at MetricsQL such as topk_max , topk_avg or topk_last .

The query also uses on() modifier for / operation. This modifier limits the set of labels, which is used for finding time series pairs on the left and the right side of / with identical label values. Then Prometheus applies the / operation individually per each such pair. See these docs for details.

The following query returns the number of CPU cores used by all the pods in Kube.netes:

sum(rate(container_cpu_usage_seconds_total{container!=""}[5m]))

The following query returns global CPU usage for k8s cluster in percentage:

100 * (
  sum(rate(container_cpu_usage_seconds_total{container!=""}[5m]))
    /
  sum(kube_node_status_capacity{resource="cpu"})
)

Some nodes may be loaded much more than the rest of nodes in Kube.netes cluster. The following query returns top3 nodes with the highest CPU load:

topk(3,
  100 * (
    sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (node)
      / on(node)
    kube_node_status_capacity{resource="cpu"})
)

Answer 3

我的主要问题是topk(1, sum(kube_node_status_capacity_memory_bytes) by (instance))无法返回值，但是现在我发现使用sum()隐蔽它可以工作，整个查询如下：

sum(sum (container_memory_usage_bytes{id="/"})by (instance))/sum(topk(1, sum(kube_node_status_capacity_memory_bytes) by (instance)))*100

count k8s cluster cpu/memory usage with prometheus

Question

3 answers

solution1
2 2019-02-25 16:17:05

solution2
1 2022-04-28 11:05:11

solution3
0 ACCPTED 2019-02-26 02:28:21

count k8s cluster cpu/memory usage with prometheus

Question

3 answers

solution1 2 2019-02-25 16:17:05

solution2 1 2022-04-28 11:05:11

solution3 0 ACCPTED 2019-02-26 02:28:21

solution1
2 2019-02-25 16:17:05

solution2
1 2022-04-28 11:05:11

solution3
0 ACCPTED 2019-02-26 02:28:21