简体繁体 English

具有高读取和非常低数据的最佳 Elasticsearch 索引分片

[英]Optimal Elasticsearch Index Shards with high Reads and very low data

原文 2020-04-07 21:25:12 3 2 amazon-web-services/ elasticsearch/ aws-elasticsearch

I am following the AWS documentation for "Choosing the number of shards" for an Elasticsearch Index.我正在关注有关 Elasticsearch 索引的“选择分片数量”的 AWS 文档。
My Read TPS for the ES Index will be very high (around 1300 TPS, and can increase to 6500 TPS), but the amount of data which will be present will be very less (lesser than a GB).我的 ES 索引读取 TPS 将非常高（大约 1300 TPS，并且可以增加到 6500 TPS），但是将出现的数据量将非常少（小于 GB）。

To match with the high reads, I am planning to implement horizontal scaling (increase the number of data nodes)为了配合高reads，我打算实现横向扩展（增加数据节点的数量）
Due to the very less data, as per the above documentation, the number of shards should be 1 (optimal desired shard size ~ 10GB-50GB, and my data being less than 1 GB)由于数据非常少，根据上述文档，分片数量应为 1（最佳所需分片大小 ~ 10GB-50GB，我的数据小于 1GB）

Questions:问题：

As far as I can understand, one shard is not distributed over data nodes .据我了解，一个分片并不分布在数据节点上。 (One shard can reside only on one data node). （一个分片只能驻留在一个数据节点上）。 Is this understanding correct?这种理解正确吗？
From here , In Elasticsearch, each query is executed in a single thread per shard. Multiple shards can however be processed in parallel, as can multiple queries and aggregations against the same shard.从这里开始， In Elasticsearch, each query is executed in a single thread per shard. Multiple shards can however be processed in parallel, as can multiple queries and aggregations against the same shard. In Elasticsearch, each query is executed in a single thread per shard. Multiple shards can however be processed in parallel, as can multiple queries and aggregations against the same shard. . . If the above understanding is correct, all the requests will be single threaded on a single data node, if I only have one shard.如果上面的理解是正确的，如果我只有一个分片，所有的请求将在一个数据节点上单线程。 The horizontal scaling thus cannot be implemented.因此不能实现水平缩放。
What should be the optimal number of primary shards/replicas for an index given the high TPS and low data?考虑到高 TPS 和低数据，索引的最佳主分片/副本数应该是多少？
Should I我是不是该
1. still have a single shard, but multiple replicas (proportional to the number of hosts), or仍然有一个分片，但有多个副本（与主机数量成正比），或者
2. multiple primary shards itself (whose size would be in MBs), and a single replica (to save on the memory).多个主分片本身（其大小以 MB 为单位）和一个副本（以节省内存）。 (I don't see nodes going down in my cluster that badly that I NEED more than one replica!) （我没有看到我的集群中的节点出现严重故障，以至于我需要多个副本！）

2 个解决方案

Yes, you are correct.是的，你是对的。 When setting up your mapping, you can set the number of shards (primary) and replicas (copies).设置映射时，您可以设置分片（主）和副本（副本）的数量。 Replica shards are basically clones of your primary shards, that are there for resiliency, but also benefit the read performance (they can serve read operations).副本分片基本上是主分片的克隆，它们具有弹性，但也有利于读取性能（它们可以提供读取操作）。 They can harm write performance though, since elastic needs to replicate the data across the nodes in order to keep them up to date.但是它们可能会损害写入性能，因为弹性需要跨节点复制数据以使它们保持最新。 Depending on the number of nodes, you can decide the number of primary and replica shards, with resiliency in mind (what happens if a node goes down)根据节点的数量，您可以决定主分片和副本分片的数量，同时考虑弹性（如果节点出现故障会发生什么）
Yes if you have one shard with zero replicas, as per the documentation, it will be a single thread.是的，如果您有一个零副本的分片，根据文档，它将是一个单线程。 That is not necessarily bad or good.这不一定是坏事或好事。 Keep in mind that in the case of one request, that request is served by multiple threads (multiple shards containing parts of the data) in the end these records need to be accumulated in order to be served to the client.请记住，在一个请求的情况下，该请求由多个线程（包含部分数据的多个分片）提供服务，最终这些记录需要累积才能提供给客户端。 This can harm performance.这会损害性能。 Moreover, even if you have replicas, if you have only one primary shard, that means that all the data of your index are in a shard (primary or replica).此外，即使你有副本，如果你只有一个主分片，那意味着你的索引的所有数据都在一个分片中（主分片或副本）。 This means that different requests can be served by any shard (thus any thread), but each request will be served by one thread (no accumulation needs to happen, which for MB of data, is not a bad thing)这意味着任何分片（因此任何线程）都可以处理不同的请求，但是每个请求将由一个线程处理（不需要发生累积，这对于 MB 的数据来说并不是一件坏事）

Since the data size is small, and you need a very high throughput, I would opt to have 1 primary and as many replicas as the number of nodes - 1 (which will hold the primary).由于数据量很小，并且您需要非常高的吞吐量，因此我会选择拥有 1 个主节点和与节点数量一样多的副本 - 1（它将保存主节点）。 Now the number of nodes depends.现在节点的数量取决于。 You'll have to test, but you could go with 3 nodes (which is a common resilient/performant first setup).您必须进行测试，但您可以使用 3 个节点的 go（这是一种常见的弹性/高性能首次设置）。 So 1 primary and 2 replicas in total.所以总共有 1 个主副本和 2 个副本。 Check with that setup and try stress testing it.检查该设置并尝试对其进行压力测试。

For the stress test you can use rally , which is the framework that elasticsearch is using when testing new releases.对于压力测试，您可以使用rally ，这是 elasticsearch 在测试新版本时使用的框架。

It's an interesting scenario, and yeah most of the information provided is quite good, just wanted to add below points:这是一个有趣的场景，是的，提供的大部分信息都很好，只是想补充以下几点：

As data size is very small, having multiple primary shards will actually lead to bad performance due to the creation of multiple threads to query multiple shards and gather results from all the shards.由于数据量非常小，拥有多个主分片实际上会导致性能不佳，因为创建多个线程来查询多个分片并从所有分片收集结果。
Now, as we need to have just 1 primary shard for optimal performance and replica to a primary shard can't be allocated on the same physical data node , you need to have other nodes in your clusters for high availability and improve read performance(replicas help in both).现在，由于我们只需要 1 个主分片以获得最佳性能，并且主分片的副本不能分配在同一个物理数据节点上，因此您需要在集群中拥有其他节点以实现高可用性并提高读取性能（副本两者都有帮助）。
Now, as for a single search query, it will have to query just one shard(either primary or replica), hence Elasticsearch will just create one thread.现在，对于单个搜索查询，它只需要查询一个分片（主分片或副本分片），因此 Elasticsearch 将只创建一个线程。
For better utilization and cost-saving make sure you have small data nodes which fewer CPU cores , in this case, 2 core machine seems reasonable(but you can benchmark this).为了更好地利用和节省成本，请确保您拥有较少 CPU 内核的小型数据节点，在这种情况下，2 核机器似乎是合理的（但您可以对此进行基准测试）。
It's good you are using the AWS Elasticsearch, so you can quickly change the no of replicas and spin up more small size(as explained above) data nodes when you have read traffic and even can change the no of cores, so better auto-scaling option you get based on some production traffic and can fine-tune further.很高兴您使用的是 AWS Elasticsearch，因此您可以在读取流量时快速更改副本数量并启动更多小尺寸（如上所述）数据节点，甚至可以更改核心数量，从而更好地自动扩展您可以根据一些生产流量获得选项，并且可以进一步微调。
you can also change no of replicas dynamically using update index setting API , but make sure to add more data nodes, when you do that if you existing data nodes CPU utilization is high.您还可以使用更新索引设置 API动态更改副本数量，但如果现有数据节点 CPU 利用率很高，请确保添加更多数据节点。