简体繁体 English

SQL 服务器崩溃丢失30分钟数据正常吗？

[英]Is it normal to lose 30 minutes of data on a SQL Server crash?

原文 2022-04-29 13:13:32 2 1 sql/ sql-server/ database/ amazon-web-services/ database-replication

We have worked with Oracle for a number of years and we now need to develop a new application using SQL Server in AWS for the first time.我们已经与 Oracle 合作多年，现在我们需要首次在 AWS 中使用 SQL 服务器开发一个新应用程序。

What surprised us, is that the new SQL Server DBA told us off the bat that SQL Server uses some kind of replication every 30 minutes (or with 30 minutes delay, I don't remember): in short, he said that on an AWS SQL Server crash, when the secondary server comes up "we will lose only 30 minutes of data in Production".令我们惊讶的是，新的 SQL 服务器 DBA 告诉我们 SQL 服务器每 30 分钟使用某种复制（或者延迟 30 分钟，我不记得了）：简而言之，他说在 AWS SQL 服务器崩溃，当辅助服务器出现时“我们只会丢失 30 分钟的生产数据”。

EDIT : By "crash", we mean the primary server is dead/unrecoverable.编辑：“崩溃”是指主服务器死机/不可恢复。

We never expected this as a normal behavior and we haven't seen something like this in Oracle... ever.我们从没想过这是一种正常行为，我们在 Oracle 中从未见过这样的事情……从来没有。

Is it normal to expect to lose 30 minutes of commits on a SQL Server crash?预计在 SQL 服务器崩溃时丢失 30 分钟的提交是否正常？ This would include payments, invoices, and other transactions that we would consider quite important.这将包括我们认为非常重要的付款、发票和其他交易。

Should I push back about this, or this is considered normal in SQL Server?我应该拒绝这个吗，或者这在 SQL 服务器中被认为是正常的？

1 个解决方案

Yes, if that's you pay for or have setup.是的，如果您为此付费或进行了设置。

For Oracle, I'm pretty sure that that you did not ask.对于 Oracle，我很确定你没有问过。 Clients always assume RTO and RPO are 0. When you explain what RTO and RPO, they request 0 for RTO and PRO.客户总是假设 RTO 和 RPO 为 0。当你解释什么是 RTO 和 RPO 时，他们要求 RTO 和 PRO 为 0。 Then when you explain the complexity and cost, the client is likely okay okay with 24 hours while we do our best for our default 15 minutes.然后，当您解释复杂性和成本时，客户可能会同意 24 小时，而我们会在默认的 15 分钟内尽力而为。 Take a look at看一眼

https://www.brentozar.com/archive/2017/10/sql-server-architecture-review/ https://www.brentozar.com/archive/2017/10/sql-server-architecture-review/

I love Brent's response for one of the questions:我喜欢布伦特对其中一个问题的回答：

"For Infrastructure-as-a-Service, there are no changes. For Platform-as-a-Service, there's no chart – just put in your credit card and turn the knob to the level of protection that you want. " “对于基础设施即服务，没有任何变化。对于平台即服务，没有图表——只需输入您的信用卡，然后将旋钮转到您想要的保护级别。”

So, what's in your wallet?那么，你的钱包里有什么？

Here is one definition of RTO and RPO, from the AWS Well-Architected documentation :以下是AWS Well-Architected 文档中对 RTO 和 RPO 的一种定义：

Recovery Time Objective (RTO) is...the maximum acceptable delay between the interruption of service and restoration of service.恢复时间目标 (RTO) 是……服务中断和服务恢复之间可接受的最大延迟。 This determines what is considered an acceptable time window when service is unavailable.这决定了当服务不可用时什么被认为是可接受的时间 window。

Recovery Point Objective (RPO) is... the maximum acceptable amount of time since the last data recovery point.恢复点目标 (RPO) 是...自上次数据恢复点以来的最长可接受时间量。 This determines what is considered an acceptable loss of data between the last recovery point and the interruption of service.这决定了在最后一个恢复点和服务中断之间什么是可接受的数据丢失。

The architecture or features determine what RPO and RTO are possible.体系结构或功能决定了哪些 RPO 和 RTO 是可能的。 For a platform as a service, you do not define the implementation.对于平台即服务，您无需定义实现。 If you pay for near zero RTO and RPO, then I'd expect something more than a single instance for HA and backup/restore for DR.如果您为接近零的 RTO 和 RPO 支付费用，那么我期望的不仅仅是用于 HA 的单个实例和用于 DR 的备份/恢复。

The following link might help explain some of this.以下链接可能有助于解释其中的一些内容。 It's a complex subject.这是一个复杂的主题。 The complexity can be a risk for RTO and RPO.复杂性可能是 RTO 和 RPO 的风险。 For example, our first SQL cluster did not failover correctly because some resources where not configured correctly as cluster resources...we lost a 9 for that.例如，我们的第一个 SQL 集群没有正确进行故障转移，因为一些资源没有正确配置为集群资源……我们为此失去了 9。 I like this link because of the image that shows the feature relation to RTO and RPO - copied here.我喜欢这个链接，因为图像显示了与 RTO 和 RPO 的特征关系 - 复制在这里。

https://techcommunity.microsoft.com/t5/core-infrastructure-and-security/an-overview-of-high-availability-and-disaster-recovery-solutions/ba-p/370479 https://techcommunity.microsoft.com/t5/core-infrastructure-and-security/an-overview-of-high-availability-and-disaster-recovery-solutions/ba-p/370479

The following link indicates the Azure instance or database has a 12h and 1h RTO and RPO, respectively, using geo restore.以下链接指示 Azure 实例或数据库分别具有 12 小时和 1 小时的 RTO 和 RPO，使用地理还原。 You can use auto failover to improve this to 1h and 5s.您可以使用自动故障转移将其缩短到 1 小时和 5 秒。 It's all very vendor specific.这都是非常特定于供应商的。

https://learn.microsoft.com/en-us/azure/azure-sql/database/business-continuity-high-availability-disaster-recover-hadr-overview?view=azuresql https://learn.microsoft.com/en-us/azure/azure-sql/database/business-continuity-high-availability-disaster-recover-hadr-overview?view=azuresql

This link shows that paying more allows the use of features and resources that provide a better RTO and RPO.此链接显示支付更多费用可以使用提供更好 RTO 和 RPO 的功能和资源。 I would expect "Hyperscale service tier zone redundant availability" to drain the wallet.我预计“超大规模服务层区域冗余可用性”会耗尽钱包。

https://learn.microsoft.com/en-us/azure/azure-sql/database/high-availability-sla?view=azuresql&tabs=azure-powershell https://learn.microsoft.com/en-us/azure/azure-sql/database/high-availability-sla?view=azuresql&tabs=azure-powershell

If doing it yourself, the feature selection is not enough.如果自己做，特征选择不够。 Having an old SAN go down for 3 days (yes) can be a real show stopper, no matter what features are used.让一个旧的 SAN go 停机 3 天（是的）可能是一个真正的表演障碍，无论使用什么功能。