简体繁体 English

Amazon Elasticache 故障转移

[英]Amazon Elasticache Failover

原文 2015-08-05 23:13:50 2 2 amazon-web-services/ redis/ amazon-elasticache

We have been using AWS Elasticache for about 6 months now without any issues.我们已经使用 AWS Elasticache 大约 6 个月了，没有任何问题。 Every night we have a Java app that runs which will flush DB 0 of our redis cache and then repopulate it with updated data.每天晚上我们都会运行一个 Java 应用程序，它会刷新 redis 缓存的 DB 0，然后用更新后的数据重新填充它。 However we had 3 instances between July 31 and August 5 where our DB was successfully flushed and then we were not able to write the new data to the database.但是，在 7 月 31 日至 8 月 5 日期间，我们有 3 个实例成功刷新了我们的数据库，然后我们无法将新数据写入数据库。

We were getting the following exception in our application:我们在应用程序中遇到以下异常：

redis.clients.jedis.exceptions.JedisDataException: redis.clients.jedis.exceptions.JedisDataException: READONLY You can't write against a read only slave. redis.clients.jedis.exceptions.JedisDataException: redis.clients.jedis.exceptions.JedisDataException: READONLY 你不能写一个只读的奴隶。

When we look at the cache events in Elasticache we can see当我们查看 Elasticache 中的缓存事件时，我们可以看到

Failover from master node prod-redis-001 to replica node prod-redis-002 completed从主节点 prod-redis-001 到副本节点 prod-redis-002 的故障转移完成

We have not been able to diagnose the issue and since the app was running fine for the past 6 months I am wondering if it is something related to a recent Elasticache release that was done on the 30th of June.我们无法诊断出这个问题，因为该应用程序在过去 6 个月内运行良好，我想知道它是否与 6 月 30 日发布的最新 Elasticache 版本有关。 https://aws.amazon.com/releasenotes/Amazon-ElastiCache https://aws.amazon.com/releasenotes/Amazon-ElastiCache

We have always been writing to our master node and we only have 1 replica node.我们一直在写入我们的主节点，我们只有 1 个副本节点。

If someone could offer any insight it would be much appreciated.如果有人可以提供任何见解，将不胜感激。

EDIT: This seems to be an intermittent problem.编辑：这似乎是一个间歇性问题。 Some days it will fail other days it runs fine.有些日子它会失败其他日子它运行良好。

2 个解决方案

We have been in contact with AWS support for the past few weeks and this is what we have found. 过去几周我们一直与AWS支持人员联系，这就是我们所发现的。

Most Redis requests are synchronous including the flush so it will block all other requests. 大多数Redis请求是同步的，包括刷新，因此它将阻止所有其他请求。 In our case we are actually flushing 19m keys and it takes more then 30 seconds. 在我们的例子中，我们实际上是冲洗19米键，它需要30秒以上。

Elasticache performs a health check periodically and since the flush is running the health check will be blocked, thus causing a failover. Elasticache会定期执行运行状况检查，并且由于刷新正在运行，因此将阻止运行状况检查，从而导致故障转移。

We have been asking the support team how often the health check is performed so we can get an idea of why our flush is only causing a failover 3-4 times a week. 我们一直在询问支持团队执行健康检查的频率，以便我们了解为什么我们的同花顺只会导致每周3-4次故障转移。 The best answer we can get is "We think its every 30 seconds". 我们能得到的最好答案是“我们每隔30秒就会想到它”。 However our flush consistently takes more then 30 seconds and doesn't consistently fail. 然而，我们的冲洗始终需要超过30秒并且不会一直失败。

They said that they may implement the ability to configure the timing of the health check however they said this would not be done anytime soon. 他们说他们可能会实现配置健康检查时间的能力，但是他们说这不会很快就会完成。

The best advice they could give us is: 他们可以给我们的最佳建议是：

1) Create a completely new cluster for loading the new data on, and instead of flushing the previous cluster, re-point your application(s) to the new cluster, and remove the old one. 1）创建一个全新的集群，用于加载新数据，而不是刷新以前的集群，将应用程序重新指向新集群，并删除旧集群。

2) If the data that you are flushing is an update version of the data, consider not flushing, but updating and overwriting new keys? 2）如果您正在刷新的数据是数据的更新版本，请考虑不刷新，但更新和覆盖新密钥？

3) Instead of flushing the data, set the expiry of the items to be when you would normally flush, and let the keys be reclaimed (possibly with a random time to avoid thundering herd issues), and then reload the data. 3）不要刷新数据，而是将项目的到期时间设置为正常刷新的时间，然后回收密钥（可能需要随机时间以避免雷鸣般的群体问题），然后重新加载数据。

Hope this helps :) 希望这可以帮助：）

Currently for Redis versions from 6.2 AWS ElastiCache has a new feature of thread monitoring.目前从 6.2 开始的 Redis 版本 AWS ElastiCache 有一个新的线程监控功能。 So the health check doesn't happen in the same thread as all other actions of Redis. Redis can continue to proceed a long command / lua script, but will still considered healthy.因此，健康检查不会与 Redis 的所有其他操作在同一线程中发生。Redis 可以继续执行长命令/lua 脚本，但仍将被视为健康。 Because of this new feature failovers should happen less.由于这个新功能，故障转移应该更少发生。