简体   繁体   中英

Amazon Aurora DB Cluster Not Auto Balancing Correctly

I have created an Amazon Aurora Database cluster runing MySQL with three instances: the main instance that backs the cluster and two read replicas for balancing. However, the cluster does not seem to be balancing the reads at all. I have one replica managing 700+ Selects/sec maximizing the CPU at 99.75% or higher while the other replica is doing virtually nothing with a CPU usage of 4% at 1 select per second, if that. The main cluster instance itself is at 33% CPU usage as it is being written to simultaneously while the replicas should are being read from. The lag time between the replicas is under 20 milliseconds. My application is querying the read only endpoint of the cluster but no balancing appears to be happening. Does anyone have any insight into why this may be happening or why the replica is at such a high CPU usage? The queries being ran against it are not complex by any means.

Aurora Cluster endpoints are DNS records and they only do DNS round robin during resolution. This means that when your client application opens connections to a cluster endpoint, you end up resolving the endpoint to different instances (different IPs basically), there by striping your connections across multiple replicas. Past that point, there is no load balancing. Connections are striped across instances, and queries run on each of those connections go to the corresponding instance backing it.

Now consider the scenario where your connection pool was already created to the cluster endpoint when you have one instance behind it. Now, if you add more instances, there will be no impact to your application, unless you terminate your connection and reestablish them. You would do a DNS round robin again, and this time some of your connections would land on the new instance that you provisioned.

Few callouts:

In Aurora, you have 2 cluster endpoints. One (RW) endpoint always points to the current writer and one (RO) does the DNS round robin between your read replicas.

Also, DNS propagation might take a few seconds when failovers happen, so that occasional errors are quite natural when failovers occur.

Hope this helps.

My guess is that you are not connecting to the cluster endpoint.

Load Balancing – Connecting to the cluster endpoint allows Aurora to load-balance connections across the replicas in the DB cluster. This helps to spread the read workload around and can lead to better performance and more equitable use of the resources available to each replica. In the event of a failover, if the replica that you are connected to is promoted to the primary instance, the connection will be dropped. You can then reconnect to the reader endpoint in order to send your read queries to the other replicas in the cluster.

New Reader Endpoint for Amazon Aurora – Load Balancing & Higher Availability

[EDIT]

To load balance within a single application, you will need to reconnect to the endpoint. If you use the same connection for all queries only one replica will be responding. However, opening connections is expensive so this might not provide much benefit unless your queries take some time to run.

We've implemented a driver to try to mitigate this problem, with some visible gains: https://github.com/DiceTechnology/dice-fairlink

It regularly discovers the read-replicas to catch up with cluster changes and round-robins connections among them.

Despite not measuring any CPU utilisation, we've observed a better load distribution than with the native DNS based round-robin of the cluster reader endpoint

The Aurora's DNS based load balancing works at the connection level (not the individual query level). You must keep resolving the endpoint without caching DNS to get a different instance IP on each resolution. If you only resolve the endpoint once and then keep the connection in your pool, every query on that connection goes to the same instance. If you cache DNS, you receive the same instance IP each time you resolve the endpoint.

Unless you use a smart database driver, you depend on DNS record updates and DNS propagation for failovers, instance scaling, and load balancing across Aurora Replicas. Currently, Aurora DNS zones use a short Time-To-Live (TTL) of 5 seconds. Ensure that your network and client configurations don't further increase the DNS cache TTL. Remember that DNS caching can occur anywhere from your network layer, through the operating system, to the application container. For example, Java virtual machines (JVMs) are notorious for caching DNS indefinitely unless configured otherwise. Here are AWS documentation and Aurora whitepaper on configuring DNS cache ttl.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM