简体   繁体   中英

Optimal Elasticsearch Index Shards with high Reads and very low data

I am following the AWS documentation for "Choosing the number of shards" for an Elasticsearch Index.
My Read TPS for the ES Index will be very high (around 1300 TPS, and can increase to 6500 TPS), but the amount of data which will be present will be very less (lesser than a GB).

  1. To match with the high reads, I am planning to implement horizontal scaling (increase the number of data nodes)
  2. Due to the very less data, as per the above documentation, the number of shards should be 1 (optimal desired shard size ~ 10GB-50GB, and my data being less than 1 GB)

Questions:

  1. As far as I can understand, one shard is not distributed over data nodes . (One shard can reside only on one data node). Is this understanding correct?
  2. From here , In Elasticsearch, each query is executed in a single thread per shard. Multiple shards can however be processed in parallel, as can multiple queries and aggregations against the same shard. In Elasticsearch, each query is executed in a single thread per shard. Multiple shards can however be processed in parallel, as can multiple queries and aggregations against the same shard. . If the above understanding is correct, all the requests will be single threaded on a single data node, if I only have one shard. The horizontal scaling thus cannot be implemented.
    What should be the optimal number of primary shards/replicas for an index given the high TPS and low data?
    Should I
    1. still have a single shard, but multiple replicas (proportional to the number of hosts), or
    2. multiple primary shards itself (whose size would be in MBs), and a single replica (to save on the memory). (I don't see nodes going down in my cluster that badly that I NEED more than one replica!)
  1. Yes, you are correct. When setting up your mapping, you can set the number of shards (primary) and replicas (copies). Replica shards are basically clones of your primary shards, that are there for resiliency, but also benefit the read performance (they can serve read operations). They can harm write performance though, since elastic needs to replicate the data across the nodes in order to keep them up to date. Depending on the number of nodes, you can decide the number of primary and replica shards, with resiliency in mind (what happens if a node goes down)
  2. Yes if you have one shard with zero replicas, as per the documentation, it will be a single thread. That is not necessarily bad or good. Keep in mind that in the case of one request, that request is served by multiple threads (multiple shards containing parts of the data) in the end these records need to be accumulated in order to be served to the client. This can harm performance. Moreover, even if you have replicas, if you have only one primary shard, that means that all the data of your index are in a shard (primary or replica). This means that different requests can be served by any shard (thus any thread), but each request will be served by one thread (no accumulation needs to happen, which for MB of data, is not a bad thing)

Since the data size is small, and you need a very high throughput, I would opt to have 1 primary and as many replicas as the number of nodes - 1 (which will hold the primary). Now the number of nodes depends. You'll have to test, but you could go with 3 nodes (which is a common resilient/performant first setup). So 1 primary and 2 replicas in total. Check with that setup and try stress testing it.

For the stress test you can use rally , which is the framework that elasticsearch is using when testing new releases.

It's an interesting scenario, and yeah most of the information provided is quite good, just wanted to add below points:

  1. As data size is very small, having multiple primary shards will actually lead to bad performance due to the creation of multiple threads to query multiple shards and gather results from all the shards.
  2. Now, as we need to have just 1 primary shard for optimal performance and replica to a primary shard can't be allocated on the same physical data node , you need to have other nodes in your clusters for high availability and improve read performance(replicas help in both).
  3. Now, as for a single search query, it will have to query just one shard(either primary or replica), hence Elasticsearch will just create one thread.
  4. For better utilization and cost-saving make sure you have small data nodes which fewer CPU cores , in this case, 2 core machine seems reasonable(but you can benchmark this).
  5. It's good you are using the AWS Elasticsearch, so you can quickly change the no of replicas and spin up more small size(as explained above) data nodes when you have read traffic and even can change the no of cores, so better auto-scaling option you get based on some production traffic and can fine-tune further.
  6. you can also change no of replicas dynamically using update index setting API , but make sure to add more data nodes, when you do that if you existing data nodes CPU utilization is high.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM