简体繁体 English

如何使用 Terraform 扩展 Kube.netes 集群以避免停机？

[英]How to scale up Kubernetes cluster with Terraform avoiding downtime?

原文 2022-02-09 08:23:22 8 1 azure/ kubernetes/ terraform/ azure-aks

Here's the scenario: we have some applications running on a Kube.netes cluster on Azure. Currently our production cluster has one Nodepool with 3 nodes which are fairly low on resources because we still don't have that many active users/requests simultaneously.场景如下：我们有一些应用程序在 Azure 上的 Kube.netes 集群上运行。目前我们的生产集群有一个 Nodepool 和 3 个节点，资源相当低，因为我们仍然没有同时有那么多活动用户/请求。

Our backend APIs app is running on three pods, one on each node.我们的后端 API 应用程序在三个 pod 上运行，每个节点一个。 I was told I will have need to increase resources soon (I'm thinking more memory or even replacing the VMs of the nodes with better ones).有人告诉我我需要尽快增加资源（我在考虑更多 memory 甚至用更好的虚拟机替换节点的虚拟机）。

We structured everything Kube.netes related using Terraform and I know that replacing VMs in a node is a destructive action, meaning the cluster will have to be replaces, new config and all deployments, services and etc will have to be reapplied.我们使用 Terraform 构建了 Kube.netes 相关的所有内容，我知道更换节点中的虚拟机是一种破坏性操作，这意味着必须更换集群、新config和所有部署、服务等都必须重新应用。

I am fairly new to the Kube.netes and Terraform world, meaning I can do the basics to get an application up and running but I would like to learn what is the best practice when it comes to scaling and performance.我是 Kube.netes 和 Terraform 世界的新手，这意味着我可以做一些基础知识来启动和运行应用程序，但我想了解在扩展和性能方面的最佳实践是什么。 How can I perform such increase in resources without having any downtime of our services?我怎样才能在不让我们的服务停机的情况下增加资源？

I'm wondering if having an extra Nodepool would help while I replace the VM's of the other one (I might be absolutely wrong here)我想知道在我替换另一个 VM 时是否有额外的 Nodepool 会有所帮助（我在这里可能完全错了）

If there's any link, course, tutorial you can point me to it's highly appreciated.如果有任何链接、课程、教程，您可以指出我非常感谢。

1 个解决方案

(Moved from comments) （从评论中移动）

In Azure, when you're performing cluster upgrade, there's a parameter called "max surge count" which is equal to 1 by default.在 Azure 中，当您执行集群升级时，有一个名为“max surge count”的参数，默认情况下等于 1。 What it means is when you update your cluster or node configuration, it will first create one extra node with the updated configuration - and only then it will safely drain and remove one of old ones.这意味着当您更新集群或节点配置时，它将首先使用更新后的配置创建一个额外的节点——然后它才会安全地耗尽并删除其中一个旧节点。 More on this here: Azure - Node Surge Upgrade更多相关信息： Azure - Node Surge Upgrade