简体繁体 English

Amazon Aurora数据库群集无法正确自动平衡

[英]Amazon Aurora DB Cluster Not Auto Balancing Correctly

原文 2017-11-07 23:58:05 5 4 mysql/ database/ amazon-web-services/ amazon-rds/ amazon-rds-aurora

I have created an Amazon Aurora Database cluster runing MySQL with three instances: the main instance that backs the cluster and two read replicas for balancing. 我创建了一个运行MySQL的Amazon Aurora数据库集群，它有三个实例：支持集群的主实例和两个用于平衡的只读副本。 However, the cluster does not seem to be balancing the reads at all. 但是，群集似乎根本没有平衡读取。 I have one replica managing 700+ Selects/sec maximizing the CPU at 99.75% or higher while the other replica is doing virtually nothing with a CPU usage of 4% at 1 select per second, if that. 我有一个副本管理700+选择/秒最大化CPU为99.75％或更高，而另一个副本几乎没有任何CPU使用率为4％，每秒1选择，如果这样。 The main cluster instance itself is at 33% CPU usage as it is being written to simultaneously while the replicas should are being read from. 主要群集实例本身的CPU使用率为33％，因为正在读取副本时正在同时写入它。 The lag time between the replicas is under 20 milliseconds. 复制品之间的滞后时间小于20毫秒。 My application is querying the read only endpoint of the cluster but no balancing appears to be happening. 我的应用程序正在查询群集的只读端点，但似乎没有发生任何平衡。 Does anyone have any insight into why this may be happening or why the replica is at such a high CPU usage? 有没有人知道为什么会发生这种情况或为什么副本处于如此高的CPU使用率？ The queries being ran against it are not complex by any means. 无论如何，针对它运行的查询都不复杂。

4 个解决方案

Aurora Cluster endpoints are DNS records and they only do DNS round robin during resolution. Aurora群集端点是DNS记录，它们仅在解析期间执行DNS循环。 This means that when your client application opens connections to a cluster endpoint, you end up resolving the endpoint to different instances (different IPs basically), there by striping your connections across multiple replicas. 这意味着当您的客户端应用程序打开与集群端点的连接时，您最终会将端点解析为不同的实例（基本上是不同的IP），通过在多个副本之间划分连接。 Past that point, there is no load balancing. 过去这一点，没有负载平衡。 Connections are striped across instances, and queries run on each of those connections go to the corresponding instance backing it. 连接在实例之间进行条带化，并且在每个连接上运行的查询将转到支持它的相应实例。

Now consider the scenario where your connection pool was already created to the cluster endpoint when you have one instance behind it. 现在考虑当您在其后面有一个实例时已经为集群端点创建连接池的情况。 Now, if you add more instances, there will be no impact to your application, unless you terminate your connection and reestablish them. 现在，如果添加更多实例，则不会对应用程序产生任何影响，除非您终止连接并重新建立连接。 You would do a DNS round robin again, and this time some of your connections would land on the new instance that you provisioned. 您将再次执行DNS循环，这次您的一些连接将落在您配置的新实例上。

Few callouts: 标注很少：

In Aurora, you have 2 cluster endpoints. 在Aurora中，您有2个群集端点。 One (RW) endpoint always points to the current writer and one (RO) does the DNS round robin between your read replicas. 一个（RW）端点始终指向当前写入程序，一个（RO）在您的只读副本之间执行DNS循环。

Also, DNS propagation might take a few seconds when failovers happen, so that occasional errors are quite natural when failovers occur. 此外，发生故障转移时，DNS传播可能需要几秒钟，因此在发生故障转移时偶然发生的错误非常自然。

Hope this helps. 希望这可以帮助。

My guess is that you are not connecting to the cluster endpoint. 我的猜测是你没有连接到集群端点。

Load Balancing – Connecting to the cluster endpoint allows Aurora to load-balance connections across the replicas in the DB cluster. 负载平衡 - 连接到群集端点允许Aurora对数据库群集中的副本进行负载平衡。 This helps to spread the read workload around and can lead to better performance and more equitable use of the resources available to each replica. 这有助于扩展读取工作负载，并可以提高性能，更公平地使用每个副本可用的资源。 In the event of a failover, if the replica that you are connected to is promoted to the primary instance, the connection will be dropped. 如果发生故障转移，如果您连接的副本被提升为主实例，则连接将被删除。 You can then reconnect to the reader endpoint in order to send your read queries to the other replicas in the cluster. 然后，您可以重新连接到阅读器端点，以便将读取的查询发送到群集中的其他副本。

New Reader Endpoint for Amazon Aurora – Load Balancing & Higher Availability 适用于Amazon Aurora的新Reader端点 - 负载平衡和更高可用性

[EDIT] [编辑]

To load balance within a single application, you will need to reconnect to the endpoint. 要在单个应用程序中进行负载平衡，您需要重新连接到端点。 If you use the same connection for all queries only one replica will be responding. 如果对所有查询使用相同的连接，则只有一个副本将响应。 However, opening connections is expensive so this might not provide much benefit unless your queries take some time to run. 但是，打开连接很昂贵，因此除非您的查询需要一些时间才能运行，否则这可能无法提供太多好处。

We've implemented a driver to try to mitigate this problem, with some visible gains: https://github.com/DiceTechnology/dice-fairlink 我们已经实现了一个驱动程序来尝试缓解这个问题，并获得了一些明显的收益： https ： //github.com/DiceTechnology/dice-fairlink

It regularly discovers the read-replicas to catch up with cluster changes and round-robins connections among them. 它会定期发现读取副本以赶上群集更改和它们之间的循环连接。

Despite not measuring any CPU utilisation, we've observed a better load distribution than with the native DNS based round-robin of the cluster reader endpoint 尽管没有测量任何CPU利用率，但我们观察到的负载分布比集群读取器端点的基于本机DNS的循环更好

The Aurora's DNS based load balancing works at the connection level (not the individual query level). Aurora基于DNS的负载均衡在连接级别（而不是单个查询级别）工作。 You must keep resolving the endpoint without caching DNS to get a different instance IP on each resolution. 您必须保持解析端点而不缓存DNS以在每个分辨率上获得不同的实例IP。 If you only resolve the endpoint once and then keep the connection in your pool, every query on that connection goes to the same instance. 如果您只解析端点一次，然后在池中保持连接，则该连接上的每个查询都会转到同一个实例。 If you cache DNS, you receive the same instance IP each time you resolve the endpoint. 如果缓存DNS，则每次解析端点时都会收到相同的实例IP。

Unless you use a smart database driver, you depend on DNS record updates and DNS propagation for failovers, instance scaling, and load balancing across Aurora Replicas. 除非您使用智能数据库驱动程序，否则您将依赖DNS记录更新和DNS传播来实现跨Aurora副本的故障转移，实例扩展和负载平衡。 Currently, Aurora DNS zones use a short Time-To-Live (TTL) of 5 seconds. 目前，Aurora DNS区域使用5秒的短生存时间（TTL）。 Ensure that your network and client configurations don't further increase the DNS cache TTL. 确保您的网络和客户端配置不会进一步增加DNS缓存TTL。 Remember that DNS caching can occur anywhere from your network layer, through the operating system, to the application container. 请记住，DNS缓存可以发生在从网络层，操作系统到应用程序容器的任何位置。 For example, Java virtual machines (JVMs) are notorious for caching DNS indefinitely unless configured otherwise. 例如，Java虚拟机（JVM）因无限期缓存DNS而臭名昭着，除非另有配置。 Here are AWS documentation and Aurora whitepaper on configuring DNS cache ttl. 以下是有关配置DNS缓存ttl的AWS 文档和Aurora白皮书。