简体繁体 English

这种零停机数据库迁移计划可行吗？

[英]Is this zero-downtime database migration plan viable?

原文 2022-07-19 11:46:08 7 1 sql/ migration/ rdbms/ downtime/ rolling-deployment

I am pondering about performing a zero-downtime database migration and came up with minimum necessary steps.我正在考虑执行零停机时间的数据库迁移，并提出了最少的必要步骤。

By "migration" I mean any change in the same database that is not backward-compatible such as renaming, splitting or dropping a column. “迁移”是指同一数据库中不向后兼容的任何更改，例如重命名、拆分或删除列。

Since I couldn't find much information elsewhere, I would like to validate my reasoning with someone having hands-on experience with this.由于我在其他地方找不到太多信息，因此我想与有这方面实践经验的人验证我的推理。 Let's imagine we have a capability to perform rolling deployments, otherwise I don't believe that zero downtime DB migration is possible.假设我们有能力执行滚动部署，否则我不相信零停机时间的数据库迁移是可能的。 So:所以：

Initial state: V1 is deployed in prod.初始状态：V1 部署在 prod 中。 It uses table1.oldColumn它使用table1.oldColumn
Goal: rename table1.oldColumn to table1.newColumn with zero downtime目标：将table1.oldColumn重命名为table1.newColumn ，停机时间为零

Steps:脚步：

Create table1.newColumn : ALTER TABLE table1 ADD COLUMN newColumn(...)创建table1.newColumn ： ALTER TABLE table1 ADD COLUMN newColumn(...)
Gradually deploy V2.逐步部署 V2。 The V2 code contains the following changes: V2 代码包含以下更改：
- SELECTs use oldColumn: SELECT oldColumn FROM table1 WHERE userId = 1001 . SELECT 使用 oldColumn： SELECT oldColumn FROM table1 WHERE userId = 1001 。 That's because only oldColumn contains full data for now while newColumn contains only a subset of it那是因为现在只有oldColumn包含完整数据，而newColumn只包含它的一个子集
- UPDATEs use both, but when a new value is missing in newColumn, it's copied from oldColumn . UPDATE 使用两者，但是当 newColumn 中缺少新值时，它会从oldColumn复制。 If we don't do that, we will chase constantly changing oldColumn forever如果我们不这样做，我们将永远追逐不断变化oldColumn
- INSERTs use both columns: INSERT INTO table1 (oldColumn, newColumn) VALUES ('abcd', 'abcd') INSERT 使用两列： INSERT INTO table1 (oldColumn, newColumn) VALUES ('abcd', 'abcd')
- DELETEs are usually irrelevant because the delete remove the entire row: DELETE FROM table1 WHERE userId = 1001 DELETE 通常是无关紧要的，因为删除会删除整行： DELETE FROM table1 WHERE userId = 1001
  - However, if the column is a UNIQUE KEY, then the oldColumn is used: DELETE FROM table1 WHERE oldColumn = 'xyz'但是，如果列是 UNIQUE KEY，则使用 oldColumn： DELETE FROM table1 WHERE oldColumn = 'xyz'
Now that all new data is always in sync, we still have a diff between oldColumn and newColumn .现在所有新数据始终保持同步， oldColumn和newColumn之间仍然存在差异。 In order to liquidate difference between oldColumn and newColumn , we run a background script copying values missing in newColumn from oldColumn为了消除oldColumn和newColumn之间的差异，我们运行一个后台脚本，从oldColumn复制newColumn中缺少的值
Now that columns are in sync, gradually deploy V3.现在列已同步，逐步部署 V3。 V3 code contains the following changes: SELECTs, UPDATEs, INSERTs and DELETEs go to newColumn now. V3 代码包含以下更改：SELECT、UPDATE、INSERT 和 DELETE 现在转到newColumn 。 table1.oldColumn is not used anymore table1.oldColumn不再使用
Drop the unused table1.oldColumn : ALTER table1 DROP COLUMN oldColumn删除未使用的table1.oldColumn ： ALTER table1 DROP COLUMN oldColumn

Note: steps 3 and 5 can be performed as part of the database migration during V2 and V3 startup注意：步骤 3 和 5 可以在 V2 和 V3 启动期间作为数据库迁移的一部分执行

Recap:回顾：

Initially newColumn is empty and all data goes to oldColumn最初newColumn为空，所有数据都转到oldColumn
While we gradually replace V1 with V2, data starts to flow into oldColumn alongside newColumn .当我们逐渐将 V1 替换为 V2 时，数据开始与oldColumn一起newColumn 。 At this point some data still flows into oldColumn only (because we are performing a rolling update so not all instances are V2)此时，一些数据仍仅流入oldColumn （因为我们正在执行滚动更新，因此并非所有实例都是 V2）
As soon as V2 is deployed, data flows in both oldColumn and newColumn .一旦部署 V2，数据就会在oldColumn和newColumn中流动。 We mirror updates and inserts to keep columns in sync我们镜像更新和插入以保持列同步
However, some data was inserted into oldColumn before newColumn was devised and some data got there from remaining V1 instances that existed during the rolling update.但是，在设计oldColumn之前将一些数据插入到newColumn中，并且一些数据是从滚动更新期间存在的剩余 V1 实例中获取的。 We must get rid of this difference我们必须摆脱这种差异
When the script is run, data in oldColumn missing in newColumn gets copied there运行脚本时，将oldColumn中缺少的newColumn中的数据复制到那里

1 个解决方案

your use of terms is a bit confusing as what you are describing is not "migration" as the term is normally used.您对术语的使用有点令人困惑，因为您所描述的不是通常使用的术语“迁移”。 Also, it is not clear what your requirements are that you've described as needing zero downtime.此外，不清楚您所描述的需要零停机时间的要求是什么。 Downtime means making something unavailable for a period of time;停机意味着使某些东西在一段时间内不可用； you can add/drop a column from a table without making that table unavailable to users so the change requires zero downtime - but obviously any query that referenced a dropped column will no longer work.您可以在不使该表对用户不可用的情况下从表中添加/删除列，因此更改需要零停机时间 - 但显然任何引用已删除列的查询将不再有效。

If you want to change your DB structure without breaking anything then either you need control over everything that accesses the DB (which is unlikely to be possible) and you can deploy the DB change and everything affected by it in one go - or you can protect users from changes by using views that hide the database implementation from them and only allowing the users to access the views.如果您想在不破坏任何内容的情况下更改数据库结构，那么您需要控制访问数据库的所有内容（这不太可能），并且您可以一次性部署数据库更改以及受其影响的所有内容 - 或者您可以保护用户通过使用隐藏数据库实现并只允许用户访问视图的视图来避免更改。

If you make changes that are so fundamental that they cannot be hidden in a view definition change then you probably have no choice but to communicate this change to your users and they will all need to go through a proper SDLC to determine if the change will affect them and to update their code if it does如果您所做的更改非常重要以至于无法隐藏在视图定义更改中，那么您可能别无选择，只能将此更改传达给您的用户，他们都需要通过适当的 SDLC 来确定更改是否会影响他们并更新他们的代码（如果有的话）