简体   繁体   中英

RDS Multi-AZ bottlenecking write performance

We are using an RDS MySQL 5.6 instance (db.m3.2xlarge) on sa-east-1 region and during write intensive operations we are seeing (on CloudWatch) that both our Write Throughput and the Network Transmit Throughput are capped at 60MB/s.

We suspected that the Multi-AZ could be responsible for this behaviour and turned it off for testing purposes. We did the same operation and noticed now that the Write Througput wasn't capped anymore and the Network Transmit Throughput was actually zero. This reinforced the idea that this network traffic is between the primary instance and the failover instance on the Multi-AZ setup.

Here is the Cloudwatch chart showing the operation without Multi-AZ and right after the same one with Multi-AZ enabled:

RDS多可用性瓶颈写入性能

We tried upgrading the instance to one with the highest network performance and also provisioned IOPs but there was no change, when Multi-AZ is on we are always capped at 60MB/s for write.

It's our understanding that Multi-AZ uses synchronous data replication but we can't find any information on the bandwidth limits for the link thru which this replication occurs. Does anyone know anything about it and how to avoid this limits? Or should we live with it?

I don't think you're seeing a limitation of the replication service per se , but it appears that your replication bandwidth shares the same transport as the EBS volume on your instance, thus it's a limitation of the Ethernet bandwidth available to your instance itself (remembering that EBS is network-attached storage).

The network connection on an m3.2xlarge is 1000 Mbit/s, which is equivalent to 125 MiB/s.

Divide that number by two and you get ~60 MB/s for writing to the local instance's EBS volume and another ~60 MB/s for writing to the synchronous replica.

Unfortunately, the implementation details of Multi-AZ replication are not something AWS has publicly explained in enough detail to say conclusively that this is indeed the explanation, but the numbers are suspiciously close to what would be predicted if it is correct.

The m3 family and m4 family of instances have similar specs but also (apparently) some fundamental design differences, so it might be informative to see if the same behavior is true of the m4.2xlarge.

I have experienced the same issue, after activating Multi AZ the Write Latency increased dramatically:

在此输入图像描述

(The instance type is m4.4xlarge)

The reason looks to be the synchronous synchronization process, each write action has to wait until both DBs are responding positively to the modification.

Looks like there is not solution and it is an expected behaviour:

DB instances using Multi-AZ deployments may have increased write and commit latency compared to a Single-AZ deployment, due to the synchronous data replication that occurs

from AWS documentation

Here is an interesting Redis thread regarding to this:

the only recommendation I see is moving to Aurora :/

Well, I never got an ACTUAL explanation from anywhere, but after tons of tests it seems that the m3.2x.large is actually "bugged". I wrote a detailed explanation in my blog .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM