Kubernetes 在调度 pod 时是否考虑了当前的内存使用情况

Question

The Kubernetes docs on https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/ state: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/上的 Kubernetes 文档状态：

The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node.调度器确保对于每种资源类型，被调度的Container 的资源请求总和小于节点的容量。

Does Kubernetes consider the current state of the node when calculating capacity? Kubernetes 在计算容量时会考虑节点的当前状态吗？ To highlight what I mean, here is a concrete example:为了强调我的意思，这里有一个具体的例子：

Assuming I have a node with 10Gi of RAM, running 10 Pods each with 500Mi of resource requests, and no limits.假设我有一个具有 10Gi RAM 的节点，运行 10 个 Pod，每个 Pod 具有 500Mi 的资源请求，并且没有限制。 Let's say they are "bursting", and each Pod is actually using 1Gi of RAM.假设它们正在“爆发”，并且每个 Pod 实际上都在使用 1Gi 的 RAM。 In this case, the node is fully utilized ( 10 x 1Gi = 10Gi ), but the resources requests are only 10 x 500Mi = 5Gi .在这种情况下，节点被充分利用（ 10 x 1Gi = 10Gi ），但资源请求只有10 x 500Mi = 5Gi 。 Would Kubernetes consider scheduling another pod on this node because only 50% of the memory capacity on the node has been requested , or would it use the fact that 100% of the memory is currently being utilized, and the node is at full capacity? Kubernetes 会考虑在该节点上调度另一个 pod，因为仅requested了该节点上 50% 的内存容量，还是会使用当前 100% 的内存正在使用并且该节点已满负荷的事实？

Answer 1

By default kubernetes will use cgroups to manage and monitor the "allocatable" memory on a node for pods.默认情况下，kubernetes 将使用 cgroups 来管理和监控节点上 pod 的“可分配”内存。 It is possible to configure kubelet to entirely rely on the static reservations and pod requests from your deployments though so the method depends on your cluster deployment.可以将kubelet配置为完全依赖来自部署的静态预留和 pod请求，因此该方法取决于您的集群部署。

In either case, a node itself will track "memory pressure", which monitors the existing overall memory usage of a node.在任何一种情况下，节点本身都会跟踪“内存压力”，它监视节点现有的整体内存使用情况。 If a node is under memory pressure then no new pods will be scheduled and existing pods will be evicted.如果节点处于内存压力之下，则不会调度新的 Pod，并且会驱逐现有的 Pod。

It's best to set sensible memory requests and limits for all workloads to help the scheduler as much as possible.最好为所有工作负载设置合理的内存请求和限制，以尽可能帮助调度程序。 If a kubernetes deployment does not configure cgroup memory monitoring, setting requests is a requirement for all workloads.如果 kubernetes 部署未配置 cgroup 内存监控，则所有工作负载都需要设置请求。 If the deployment is using cgroup memory monitoring, at least setting requests give the scheduler extra detail as to whether the pods to be scheduled should fit on a node.如果部署使用 cgroup 内存监控，至少设置请求会向调度程序提供有关要调度的 pod 是否应该适合节点的额外详细信息。

Capacity and Allocatable Resources容量和可分配资源

The Kubernetes Reserve Compute Resources docco has a good overview of how memory is viewed on a node. Kubernetes Reserve Compute Resources docco很好地概述了如何在节点上查看内存。

      Node Capacity
---------------------------
|     kube-reserved       |
|-------------------------|
|     system-reserved     |
|-------------------------|
|    eviction-threshold   |
|-------------------------|
|                         |
|      allocatable        |
|   (available for pods)  |
|                         |
|                         |
---------------------------

The default scheduler checks a node isn't under memory pressure, then looks at the allocatable memory available on a node and whether the new pods requests will fit in it.默认调度程序会检查节点是否处于内存压力之下，然后查看节点上可用的可分配内存以及新的 pod请求是否适合其中。

The allocatable memory available is the total-available-memory - kube-reserved - system-reserved - eviction-threshold - scheduled-pods .可用的可分配内存是total-available-memory - kube-reserved - system-reserved - eviction-threshold - scheduled-pods 。

Scheduled Pods预定的 Pod

The value for scheduled-pods can be calculated via a dynamic cgroup, or statically via the pods resource requests . scheduled-pods的值可以通过动态 cgroup 计算，也可以通过 pods资源请求静态计算。

The kubelet --cgroups-per-qos option which defaults to true enables cgroup tracking of scheduled pods.默认为true的 kubelet --cgroups-per-qos选项启用 cgroup 跟踪调度的 pod。 The pods kubernetes runs will be in kubernetes 运行的 pods 将在

If --cgroups-per-qos=false then the allocatable memory will only be reduced by the resource requests that scheduled on a node.如果--cgroups-per-qos=false则可分配的内存只会因节点上调度的资源请求而减少。

Eviction Threshold驱逐阈值

The eviction-threshold is the level of free memory when Kubernetes starts evicting pods. eviction-threshold是 Kubernetes 开始驱逐 pod 时的可用内存级别。 This defaults to 100MB but can be set via the kubelet command line.默认为 100MB，但可以通过 kubelet 命令行设置。 This setting is teid to both the allocatable value for a node and also the memory pressure state of a node in the next section.此设置与节点的可分配值以及下一节中节点的内存压力状态有关。

System Reserved系统保留

kubelets system-reserved value can be configured as a static value ( --system-reserved= ) or monitored dynamically via cgroup ( --system-reserved-cgroup= ). kubelets system-reserved值可以配置为静态值（ --system-reserved= ）或通过 cgroup 动态监控（ --system-reserved-cgroup= ）。 This is for any system daemons running outside of kubernetes ( sshd , systemd etc).这适用于在 kubernetes 之外运行的任何系统守护进程（ sshd 、 systemd等）。 If you configure a cgroup, the processes all need to be placed in that cgroup.如果你配置了一个cgroup，那么所有的进程都需要放在那个cgroup中。

Kube Reserved Kube 保留

kubelets kube-reserved value can be configured as a static value (via --kube-reserved= ) or monitored dynamically via cgroup ( --kube-reserved-cgroup= ). kubelets kube-reserved值可以配置为静态值（通过--kube-reserved= ）或通过 cgroup 动态监控（ --kube-reserved-cgroup= ）。 This is for any kubernetes services running outside of kubernetes, usually kubelet and a container runtime.这适用于在 kubernetes 之外运行的任何 kubernetes 服务，通常是kubelet和容器运行时。

Capacity and Availability on a Node节点上的容量和可用性

Capacity is stored in the Node object.容量存储在 Node 对象中。

$ kubectl get node node01 -o json | jq '.status.capacity'
{
  "cpu": "2",
  "ephemeral-storage": "61252420Ki",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "4042284Ki",
  "pods": "110"
}

The allocatable value can be found on the Node, you can note than existing usage doesn't change this value.可以在节点上找到可分配的值，您可以注意到现有使用不会更改此值。 Only schduleding pods with resource requests will take away from the allocatable value.只有具有资源请求的调度 pod 才会从allocatable值中删除。

$ kubectl get node node01 -o json | jq '.status.allocatable'
{
  "cpu": "2",
  "ephemeral-storage": "56450230179",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "3939884Ki",
  "pods": "110"
}

Memory Usage and Pressure内存使用和压力

A kube node can also have a "memory pressure" event.一个 kube 节点也可以有一个“内存压力”事件。 This check is done outside of the allocatable resource checks above and is more a system level catch all.此检查是在上面的可分配资源检查之外完成的，并且更像是系统级别的全部检查。 Memory pressure looks at the current root cgroup memory usage minus the inactive file cache/buffers, similar to the calculation free does to remove the file cache.内存压力查看当前根 cgroup 内存使用量减去非活动文件缓存/缓冲区，类似于free删除文件缓存的计算。

A node under memory pressure will not have pods scheduled, and will actively try and evict existing pods until the memory pressure state is resolved.处于内存压力下的节点不会调度 Pod，并且会主动尝试驱逐现有的 Pod，直到内存压力状态得到解决。

You can set the eviction threshold amount of memory kubelet will maintain available via the --eviction-hard=[memory.available<500Mi] flag.您可以通过--eviction-hard=[memory.available<500Mi]标志设置 kubelet 将保持可用的内存驱逐阈值量。 The memory requests and usage for pods can help informs the eviction process. Pod 的内存请求和使用情况可以帮助通知驱逐过程。

kubectl top node will give you the existing memory stats for each node (if you have a metrics service running). kubectl top node将为您提供每个节点的现有内存统计信息（如果您正在运行指标服务）。

$ kubectl top node
NAME                 CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
node01               141m         7%     865Mi           22%

If you were not using cgroups-per-qos and a number of pods without resource limits, or a number of system daemons then the cluster is likely to have some problems scheduling on a memory constrained system as allocatable will be high but the actual value might be really low.如果您没有使用cgroups-per-qos和许多没有资源限制的 pod，或许多系统守护进程，那么集群可能会在内存受限的系统上调度一些问题，因为可分配的会很高，但实际值可能真的很低。

Memory Pressure calculation内存压力计算

Kubernetes Out Of Resource Handling docco includes a script which emulates kubelets memory monitoring process: Kubernetes Out Of Resource Handling docco包含一个模拟 kubelets 内存监控过程的脚本：

# This script reproduces what the kubelet does
# to calculate memory.available relative to root cgroup.

# current memory usage
memory_capacity_in_kb=$(cat /proc/meminfo | grep MemTotal | awk '{print $2}')
memory_capacity_in_bytes=$((memory_capacity_in_kb * 1024))
memory_usage_in_bytes=$(cat /sys/fs/cgroup/memory/memory.usage_in_bytes)
memory_total_inactive_file=$(cat /sys/fs/cgroup/memory/memory.stat | grep total_inactive_file | awk '{print $2}')

memory_working_set=${memory_usage_in_bytes}
if [ "$memory_working_set" -lt "$memory_total_inactive_file" ];
then
    memory_working_set=0
else
    memory_working_set=$((memory_usage_in_bytes - memory_total_inactive_file))
fi

memory_available_in_bytes=$((memory_capacity_in_bytes - memory_working_set))
memory_available_in_kb=$((memory_available_in_bytes / 1024))
memory_available_in_mb=$((memory_available_in_kb / 1024))

echo "memory.capacity_in_bytes $memory_capacity_in_bytes"
echo "memory.usage_in_bytes $memory_usage_in_bytes"
echo "memory.total_inactive_file $memory_total_inactive_file"
echo "memory.working_set $memory_working_set"
echo "memory.available_in_bytes $memory_available_in_bytes"
echo "memory.available_in_kb $memory_available_in_kb"
echo "memory.available_in_mb $memory_available_in_mb"

Answer 2

Definitely YES , Kubernetes consider memory usage during pod scheduling process.肯定是的，Kubernetes 在 pod 调度过程中会考虑内存使用情况。

The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled containers is less than the capacity of the node.调度器确保对于每种资源类型，被调度容器的资源请求总和小于节点的容量。 Note that although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails.请注意，尽管节点上的实际内存或 CPU 资源使用率非常低，但如果容量检查失败，调度程序仍然拒绝将 Pod 放置在节点上。 This protects against a resource shortage on a node when resource usage later increases, for example, during a daily peak in request rate.当资源使用量稍后增加时，例如在请求率的每日高峰期间，这可以防止节点上的资源短缺。

There are two key concepts in scheduling.调度中有两个关键概念。 First one, the scheduler attempts to filter the nodes that are capable of running a given pod based on resource requests and other scheduling requirements.第一个，调度器尝试根据资源请求和其他调度要求过滤能够运行给定 pod 的节点。 Second, the scheduler weighs the eligible nodes based on absolute and relative resource utilization of the nodes and other factors.其次，调度器根据节点的绝对和相对资源利用率等因素对合格节点进行权衡。 The highest weighted eligible node is selected for scheduling of the pod.选择权重最高的合格节点进行 pod 的调度。 Good explanation of scheduling in Kuberneres you can find here: kubernetes-scheduling .你可以在这里找到关于 Kuberneres 调度的很好的解释： kubernetes-scheduling 。

Simple example : your pod normally uses 100 Mi of ram but you run it with a 50 Mi request.简单示例：您的 pod 通常使用 100 Mi 的 ram，但您以 50 Mi 的请求运行它。 If you have a node with 75 Mi free the scheduler may choose to run the pod there.如果您有一个空闲 75 Mi 的节点，调度程序可能会选择在那里运行 pod。 When pod memory consumption later expands to 100 Mi this puts the node under pressure , at which point the kernel may choose to kill your process.当 pod 内存消耗后来扩展到 100 Mi 时，这会使节点承受压力，此时内核可能会选择终止您的进程。 So it is important to get both memory requests and memory limits right.因此，正确处理内存请求和内存限制非常重要。 About memory usage, requests and limits you can read more here: memory-resource .关于内存使用、请求和限制，您可以在此处阅读更多信息： memory-resource 。

A container can exceed its memory request if the node has memory available.如果节点有可用内存，则容器可能会超出其内存请求。 But a container is not allowed to use more than its memory limit.但是不允许容器使用超过其内存限制。 If a container allocates more memory than its limit, the container becomes a candidate for termination.如果容器分配的内存超过其限制，则该容器将成为终止的候选对象。 If the container continues to consume memory beyond its limit, the container is terminated.如果容器继续消耗超出其限制的内存，则容器将被终止。 If a terminated container can be restarted the kubelet restarts it, as with any other type of runtime failure.如果可以重新启动已终止的容器，kubelet 会重新启动它，就像任何其他类型的运行时故障一样。

I hope its helps.我希望它有帮助。

Answer 3

Yes, Kubernetes will consider current memory usage when scheduling Pods (not just requests ), so your new Pod wouldn't get scheduled on the full node.是的，Kubernetes 在调度 Pod 时会考虑当前的内存使用情况（不仅仅是requests ），因此您的新 Pod 不会在完整节点上被调度。 Of course, there are also a number of other factors .当然，还有其他一些因素。

(FWIW, when it comes to resources , a request helps the scheduler by declaring a baseline value, and a limit kills the Pod when resources exceed that value, which helps with capacity planning/estimation. （FWIW，当涉及到resources ， request通过声明基线值来帮助调度程序，当资源超过该值时， limit杀死 Pod，这有助于容量规划/估计。

Kubernetes 在调度 pod 时是否考虑了当前的内存使用情况

问题描述

3 个解决方案

解决方案1
9 2019-06-27 23:06:44

Capacity and Allocatable Resources容量和可分配资源

Scheduled Pods预定的 Pod

Eviction Threshold驱逐阈值

System Reserved系统保留

Kube Reserved Kube 保留

Capacity and Availability on a Node节点上的容量和可用性

Memory Usage and Pressure内存使用和压力

Memory Pressure calculation内存压力计算

解决方案2
1 2019-06-14 10:49:04

解决方案3
0 2019-06-07 15:42:34

Kubernetes 在调度 pod 时是否考虑了当前的内存使用情况

问题描述

3 个解决方案

解决方案1 9 2019-06-27 23:06:44

Capacity and Allocatable Resources容量和可分配资源

Scheduled Pods预定的 Pod

Eviction Threshold驱逐阈值

System Reserved系统保留

Kube Reserved Kube 保留

Capacity and Availability on a Node节点上的容量和可用性

Memory Usage and Pressure内存使用和压力

Memory Pressure calculation内存压力计算

解决方案2 1 2019-06-14 10:49:04

解决方案3 0 2019-06-07 15:42:34

解决方案1
9 2019-06-27 23:06:44

解决方案2
1 2019-06-14 10:49:04

解决方案3
0 2019-06-07 15:42:34