使用 terraform 为故障转移创建相同的并行基础架构

Question

I have a requirement to use terraform to provision identical copies of the same infrastructure in different places for failover purposes.我需要使用 terraform 在不同位置提供相同基础架构的相同副本以进行故障转移。 For example, I have 2 Kubernetes clusters A and B;例如，我有 2 个 Kubernetes 集群 A 和 B； I want to be able to use terraform to provision them both to the identical state.我希望能够使用 terraform 将它们都配置到相同的 state。 It would be as if there was one terraform plan, and 2 parallel applies to different "destinations" would happen for each apply.就好像有一个 terraform 计划，并且每个申请都会发生 2 个并行应用到不同的“目的地”。

Using provider aliases comes to mind, but that would require duplicating code for everything.想到使用提供者别名，但这需要为所有内容复制代码。 Workspaces aren't a good fit either because each set of infrastructure is a first class citizen that should be in-sync with the other.工作区也不适合，因为每组基础设施都是第一个 class 公民，应该与其他公民同步。

The best I've come up with is to use partial configuration for the backend https://www.terraform.io/language/settings/backends/configuration#partial-configuration and using variables in the provider block like so:我想出的最好的方法是对后端https://www.terraform.io/language/settings/backends/configuration#partial-configuration使用部分配置并在提供程序块中使用变量，如下所示：

provider "kubernetes" {
  cluster = var.foo
}

And run terraform using:并使用以下命令运行 terraform：

terraform init -backend-config="baz=bat"
terraform plan -var "foo=bar"

Using this approach, there's a separate backend state for each copy of the infrastructure, and the provider will be pointed to the right destination via command line variables.使用这种方法，每个基础设施副本都有一个单独的后端 state，并且提供者将通过命令行变量指向正确的目的地。

The above would work, but would require a separate init, plan, and apply for each distinct copy of the infrastructure being provisioned.上述方法可行，但需要单独的初始化、计划和申请每个不同的基础设施副本。 Is that the best that I can hope for, or is there a better approach to combine everything into one workflow?这是我所希望的最好的，还是有更好的方法将所有内容组合到一个工作流程中？

EDIT: Adding more context based on a comment below.编辑：根据下面的评论添加更多上下文。 The scenario is that cluster A is in a less expensive, less reliable datacenter and cluster B is in a more expensive, more reliable datacenter.该场景是集群 A 位于成本较低、可靠性较低的数据中心中，而集群 B 位于成本更高、可靠性更高的数据中心中。 To save costs, we want to run primarily in the less expensive datacenter, but have fully provisioned infrastructure ready to go if there is an outage in the primary datacenter.为了节省成本，我们希望主要在成本较低的数据中心中运行，但如果主数据中心出现中断，我们会为 go 准备好完全配置的基础架构。 We'd keep cluster B artificially too small (to achieve the cost savings) until we lose cluster A, at which point we'd scale out cluster B to manage the full workload.我们会人为地将集群 B 保持得太小（以实现成本节约），直到我们失去集群 A，此时我们将扩展集群 B 以管理全部工作负载。

Answer 1

The situation you are describing sounds like a variation on the typical idea of "environments" where you have two independent production environments, rather than eg separate stating and production stages.您所描述的情况听起来像是“环境”的典型概念的变体，您有两个独立的生产环境，而不是例如单独的陈述和生产阶段。

The good news is that you can therefore employ mostly the same strategy that's typical for multiple deployment stages: factor out your common infrastructure into a shared module and write two different configurations that refer to it with some different settings.好消息是，您因此可以采用与多个部署阶段相同的策略：将您的公共基础设施分解为一个共享模块，并编写两个不同的配置来引用它，并使用一些不同的设置。

Each of your configurations will presumably consist just of a backend configuration, a provider configuration, and a call to the shared module, like this:您的每个配置可能仅包含后端配置、提供程序配置和对共享模块的调用，如下所示：

terraform {
  backend "example" {
    # ...
  }

  required_providers {
    kubernetes = {
      source = "hashicorp/kubernetes"
    }
  }
}

provider "kubernetes" {
  cluster = "whichever-cluster-is-appropriate-here"
}

module "main" {
  source = "../modules/main"

  # (whatever settings make sense for this environment)
}

This structure keeps all of the per-environment settings together in a single configuration, so you can just switch into this directory and run the normal Terraform commands (with no unusual extra options) to update that particular environment.此结构将所有每个环境的设置保存在一个配置中，因此您只需切换到此目录并运行正常的 Terraform 命令（没有异常的额外选项）即可更新该特定环境。

From your description it seems like a key requirement here is that each of your environments is a separate failure domain and that's one of the typical reasons to split infrastructure into two separate configurations.从您的描述来看，这里的一个关键要求似乎是您的每个环境都是一个单独的故障域，这是将基础架构拆分为两个单独配置的典型原因之一。 Doing so will help ensure that an outage of the underlying platform infrastructure in one environment cannot prevent you from using Terraform to manage the other environment.这样做将有助于确保一个环境中底层平台基础架构的中断不会阻止您使用 Terraform 来管理另一个环境。

If you intend to build automation around your Terraform runs (which I'd recommend) I'd suggest configuring your automation so that any change to the shared module will automatically trigger a run for both of your environments, just so you can make sure they're always routinely getting updated and thus you won't end up in an awkward situation where you try to fail over and find that the backup environment is "stale" and needs significant updates before you could fail over into it.如果您打算围绕您的 Terraform 运行构建自动化（我建议），我建议配置您的自动化，以便对共享模块的任何更改都会自动触发您的两个环境的运行，这样您就可以确保它们'总是定期更新，因此您不会陷入尴尬的境地，您尝试故障转移并发现备份环境“陈旧”并且需要大量更新才能故障转移到它。

Of course, you'd need to make sure that a failure of one of those runs cannot block applying the other one, because otherwise you will have combined the failure domains together and could prevent yourself from failing over in the event of an outage.当然，您需要确保其中一个运行的失败不会阻止应用另一个，因为否则您会将故障域组合在一起，并且可能会阻止您在发生中断时进行故障转移。 The way I would imagine it working (in principle) is that, if there is an outage:我想象它的工作方式（原则上）是，如果出现中断：

You change the configuration of the backup environment to increase its scale.您更改备份环境的配置以增加其规模。
That triggers a run only for the backup environment , because the shared module hasn't changed.这只会触发备份环境的运行，因为共享模块没有改变。 You can apply that to scale up the backup environment.您可以应用它来扩展备份环境。
You change some setting outside of the scope of both of these environments to redirect incoming requests into the backup environment until the outage is over.您更改了这两个环境的 scope 之外的一些设置，以将传入请求重定向到备份环境，直到中断结束。

In the event that you do end up needing to change the shared module during an outage, the flow is similar except that step 2 would trigger a run for each of the environments and the primary environment's run would fail, but you can ignore that for now and just apply the backup environment changes.如果您最终需要在中断期间更改共享模块，则流程类似，只是第 2 步会触发每个环境的运行，而主环境的运行会失败，但您现在可以忽略它并且只需应用备份环境更改。 Once the outage is over, you can re-run the primary environment's run to "catch up" with the changes made in the backup environment before you flip back to the primary environment again, and then scale the backup environment back down.中断结束后，您可以重新运行主环境的运行以“赶上”备份环境中所做的更改，然后再次切换回主环境，然后缩减备份环境。

The key theme here is that Terraform is a building block of a solution here but is not the entire solution itself: Terraform can help you make the changes you need to make, but you will need to build your own workflow (automated or not) around Terraform to make sure that Terraform is running in the appropriate context at the appropriate time to respond to an outage.这里的关键主题是 Terraform 是这里解决方案的构建块，但不是整个解决方案本身：Terraform 可以帮助您进行所需的更改，但您需要围绕周围构建自己的工作流程（自动化或非自动化） Terraform 以确保 Terraform 在适当的时间在适当的上下文中运行以响应中断。

使用 terraform 为故障转移创建相同的并行基础架构

问题描述

1 个解决方案

解决方案1
0 2022-08-24 15:21:03

使用 terraform 为故障转移创建相同的并行基础架构

问题描述

1 个解决方案

解决方案1 0 2022-08-24 15:21:03

解决方案1
0 2022-08-24 15:21:03