在 GKE 中升级到更大的节点池

Question

I have a node-pool (default-pool) in a GKE cluster with 3 nodes, machine type n1-standard-1.我在具有 3 个节点的 GKE 集群中有一个节点池（默认池），机器类型为 n1-standard-1。 They host 6 pods with a redis cluster in it (3 masters and 3 slaves) and 3 pods with an nodejs example app in it.他们托管 6 个 Pod，其中包含一个 redis 集群（3 个 master 和 3 个 slave）和 3 个 Pod，其中包含一个 nodejs 示例应用程序。

I want to upgrade to a bigger machine type (n1-standard-2) with also 3 nodes.我想升级到更大的机器类型（n1-standard-2），也有 3 个节点。

In the documentation , google gives an example to upgrade to a different machine type (in a new node pool).在文档中，谷歌给出了一个升级到不同机器类型的例子（在一个新的节点池中）。

I have tested it while in development, and my node pool was unreachable for a while while executing the following command:我在开发过程中对其进行了测试，在执行以下命令时，我的节点池有一段时间无法访问：

for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=default-pool -o=name); do
  kubectl cordon "$node";
done

In my terminal, I got a message that my connection with the server was lost (I could not execute kubectl commands).在我的终端中，我收到一条消息，提示我与服务器的连接丢失（我无法执行 kubectl 命令）。 After a few minutes, I could reconnect and I got the desired output as shown in the documentation.几分钟后，我可以重新连接并获得所需的 output，如文档所示。

The second time, I tried leaving out the cordon command and I skipped to the following command:第二次，我尝试省略警戒线命令，然后跳到以下命令：

for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=default-pool -o=name); do
  kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 "$node";
done

This because if I interprete the kubernetes documentation correctly, the nodes are automatically cordonned when using the drain command.这是因为如果我正确解释 kubernetes 文档，则在使用 drain 命令时节点会自动封锁。 But I got the same result as with the cordon command: I lost connection to the cluster for a few minutes, and I could not reach the nodejs example app that was hosted on the same nodes.但是我得到了与使用警戒线命令相同的结果：我失去了与集群的连接几分钟，并且无法访问托管在相同节点上的 nodejs 示例应用程序。 After a few minutes, it restored itself.几分钟后，它恢复了自己。

I found a workaround to upgrade to a new node pool with bigger machine types: I edited the deployment/statefulset yaml files and changed the nodeSelector.我找到了升级到具有更大机器类型的新节点池的解决方法：我编辑了部署/状态集 yaml 文件并更改了 nodeSelector。 Node pools in GKE are tagged with: GKE 中的节点池标记为：

cloud.google.com/gke-nodepool=NODE_POOL_NAME

so I added the correct nodeSelector to the deployment.yaml file:所以我将正确的 nodeSelector 添加到 deployment.yaml 文件中：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: example-deployment
  labels:
    app: example
spec:
  replicas: 3
  selector:
    matchLabels:
      app: example
  template:
    metadata:
      labels:
        app: example
    spec:
      nodeSelector:
            cloud.google.com/gke-nodepool:  new-default-pool
      containers:
      - name: example
        image: IMAGE
        ports:
        - containerPort: 3000

This works without downtime, but I'm not sure this is the right way to do in a production environment.这无需停机即可工作，但我不确定这是在生产环境中执行的正确方法。

What is wrong with the cordon/drain command, or am I not using them correctly?警戒线/排水命令有什么问题，或者我没有正确使用它们？

Answer 1

Cordoning a node will cause it to be removed from the load balancers backend list, so will a drain.封锁一个节点将导致它从负载平衡器后端列表中删除，因此也会导致流失。 The correct way to do it is to set up anti-affinity rules on the deployment so the pods are not deployed on the same node, or the same region for that matter.正确的做法是在部署中设置反关联规则，这样 Pod 就不会部署在同一节点或同一区域上。 That will cause an even distribution of pods throught your node pool.这将导致 Pod 均匀分布在您的节点池中。

Then you have to disable autoscaling on the old node pool if you have it enabled, slowly drain 1-2 nodes a time and wait for them to appear on the new node pool, making sure at all times to keep one pod of the deployment alive so it can handle traffic.然后，如果您启用了旧节点池，则必须禁用自动缩放，一次慢慢耗尽 1-2 个节点并等待它们出现在新节点池中，确保始终保持部署的一个 pod 处于活动状态所以它可以处理流量。

在 GKE 中升级到更大的节点池

问题描述

1 个解决方案

解决方案1
3 2020-05-25 00:36:18

在 GKE 中升级到更大的节点池

问题描述

1 个解决方案

解决方案1 3 2020-05-25 00:36:18

解决方案1
3 2020-05-25 00:36:18