用于跨所有节点调度 pod 的 POD 亲和性规则

Question

we are running 6 nodes in K8s cluster.我们在 K8s 集群中运行6个节点。 Out of 6, 2 of them running RabbitMQ, Redis & Prometheus we have used node-selector & cordon node so no other pods schedule on that particular nodes.在 6 个中，其中 2 个运行 RabbitMQ、Redis 和 Prometheus，我们使用了节点选择器和警戒节点，因此在该特定节点上没有其他 pod 调度。

On renaming other 4 nodes application PODs run, we have around 18-19 micro services.在重命名其他 4 个节点的应用程序 POD 运行时，我们有大约 18-19 个微服务。 For GKE there is one open issue in K8s official repo regarding auto scale down: https://github.com/kubernetes/kubernetes/issues/69696#issuecomment-651741837 automatically however people are suggesting approach of setting PDB and we that tested on Dev/Stag.对于 GKE，在 K8s 官方 repo 中有一个关于自动缩减的未解决问题： https://github.com/kubernetes/kubernetes/issues/69696#issuecomment-651741837自动但是人们建议设置PDB的方法，我们在 Dev 上进行了测试/麈。

What we are looking for now is to fix PODs on particular node pool which do not scale, as we are running single replicas of some services.我们现在正在寻找的是修复无法扩展的特定节点池上的 POD，因为我们正在运行某些服务的单个副本。

As of now, we thought of using and apply affinity to those services which are running with single replicas and no requirement of scaling .截至目前，我们考虑使用和应用亲和力到那些以单副本运行且不需要扩展的服务。

while for scalable services we won't specify any type of rule so by default K8s scheduler will schedule pod across different nodes, so this way if any node scale down we dont face any downtime for single running replica service.而对于可扩展服务，我们不会指定任何类型的规则，因此默认情况下，K8s 调度程序将跨不同节点调度 pod，因此如果任何节点缩减，我们不会面临单个运行副本服务的任何停机时间。

Affinity example:亲和力示例：

affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: do-not-scale
                operator: In
                values:
                - 'true'

We are planning to use affinity type preferredDuringSchedulingIgnoredDuringExecution instead of requiredDuringSchedulingIgnoredDuringExecution .我们计划使用亲和力类型preferredDuringSchedulingIgnoredDuringExecution而不是requiredDuringSchedulingIgnoredDuringExecution 。

Note: Here K8s is not creating new replica first on another node during node drain (scaledown of any node) as we are running single replicas with rolling update & minAvailable: 25% strategy.注意：这里 K8s 不会在节点耗尽期间（任何节点的缩减）首先在另一个节点上创建新副本，因为我们正在运行具有滚动更新和 minAvailable: 25% 策略的单个副本。

Why: If PodDisruptionBudget is not specified and we have a deployment with one replica, the pod will be terminated and then a new pod will be scheduled on a new node.原因：如果没有指定 PodDisruptionBudget 并且我们有一个具有一个副本的部署，则该 pod 将被终止，然后将在新节点上安排一个新的 pod。

To make sure the application will be available during the node draining process we have to specify PodDisruptionBudget and create more replicas.为了确保应用程序在节点耗尽过程中可用，我们必须指定 PodDisruptionBudget 并创建更多副本。 If we have 1 pod with minAvailable: 30% it will refuse to drain node (scaledown).如果我们有 1 个具有 minAvailable: 30% 的 pod，它将拒绝耗尽节点（按比例缩小）。

Please point out a mistake if you are seeing anything wrong & suggest better option.如果您发现任何错误，请指出错误并提出更好的选择。

Answer 1

First of all, defining PodDisruptionBudget makes not much sense whan having only one replica.首先，如果只有一个副本，定义PodDisruptionBudget没有多大意义。 minAvailable expressed as a percentage is rounded up to an integer as it represents the minimum number of Pods which need to be available all the time.以百分比表示的minAvailable向上舍入为Pods ，因为它表示需要始终可用的 Pod 的最小数量。

Keep in mind that you have no guarantee for any High Availability when launching only one-replica Deployments .请记住，仅启动一个副本Deployments时，您无法保证任何高可用性。

Why: If PodDisruptionBudget is not specified and we have a deployment with one replica, the pod will be terminated and then a new pod will be scheduled on a new node.原因：如果没有指定 PodDisruptionBudget 并且我们有一个具有一个副本的部署，则该 pod 将被终止，然后将在新节点上安排一个新的 pod。

If you didn't explicitely define in your Deployment 's spec the value of maxUnavailable , by default it is set to 25%, which being rounded up to an integer (representing number of Pods / replicas ) equals 1 .如果您没有在Deployment的spec中明确定义maxUnavailable的值，则默认设置为 25%，四舍五入为 integer （表示Pods / replicas的数量）等于1 。 It means that 1 out of 1 replicas is allowed to be unavailable.这意味着允许 1 个副本中的 1 个不可用。

If we have 1 pod with minAvailable: 30% it will refuse to drain node (scaledown).如果我们有 1 个具有 minAvailable: 30% 的 pod，它将拒绝耗尽节点（按比例缩小）。

Single replica with minAvailable: 30% is rounded up to 1 anyway.具有minAvailable: 30%的单个副本无论如何都会四舍五入为1 。 1/1 should be still up and running so Pod cannot be evicted and node cannot be drained in this case. 1/1应该仍然启动并运行，因此在这种情况下不能驱逐Pod并且不能耗尽节点。

You can try the following solution however I'm not 100% sure if it will work when your Pod is re-scheduled to another node due to it's eviction from the one it is currently running on.您可以尝试以下解决方案，但是我不能 100% 确定当您的Pod被重新安排到另一个节点时它是否会工作，因为它会从当前运行的节点中逐出。

But if you re-create your Pod eg because you update it's image to a new version, you can guarantee that at least one replica will be still up and running (old Pod won't be deleted unless the new one enters Ready state) by setting maxUnavailable: 0 .但是如果你重新创建你的Pod ，例如因为你将它的镜像更新到一个新版本，你可以保证至少有一个副本仍然可以运行（除非新的 Pod 进入Ready状态，否则旧的Pod不会被删除）设置maxUnavailable: 0 。 As per the docs , by default it is set to 25% which is rounded up to 1 .根据docs ，默认情况下它设置为25% ，四舍五入为1 。 So by default you allow that one of your replicas (which in your case happens to be 1/1 ) becomes unavailable during the rolling update.因此，默认情况下，您允许您的副本之一（在您的情况下恰好是1/1 ）在滚动更新期间变得不可用。 If you set it to zero, it won't allow the old Pod to be deleted unless the new one becomes Ready .如果您将其设置为零，它将不允许删除旧Pod ，除非新 Pod 变为Ready 。 At the same time maxSurge: 2 allows that 2 replicas temporarily exist at the same time during the update.同时maxSurge: 2允许 2 个副本在更新过程中暂时同时存在。

Your Deployment definition may begin as follows:您的Deployment定义可能开始如下：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: example
spec:
  replicas: 1
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0 👈
      maxSurge: 2
  selector:
  ...

Compare it with this answer , provided by mdaniel , where I originally found it.将它与我最初找到它的mdaniel提供的这个答案进行比较。

用于跨所有节点调度 pod 的 POD 亲和性规则

问题描述

1 个解决方案

解决方案1
1 2020-11-25 18:10:38

用于跨所有节点调度 pod 的 POD 亲和性规则

问题描述

1 个解决方案

解决方案1 1 2020-11-25 18:10:38

解决方案1
1 2020-11-25 18:10:38