简体   繁体   中英

Elasticsearch percolation dead slow on AWS EC2

Recently we switched our cluster to EC2 and everything is working great... except percolation :(

We use Elasticsearch 2.2.0. To reindex (and percolate) our data we use a separate EC2 c3.8xlarge instance (32 cores, 60GB, 2 x 160 GB SSD) and tell our index to include only this node in allocation. Because we'll distribute it amongst the rest of the nodes later, we use 10 shards, no replicas (just for indexing and percolation). There are about 22 million documents in the index and 15.000 percolators. The index is a tad smaller than 11GB (and so easily fits into memory). About 16 php processes talk to the REST API doing multi percolate requests with 200 requests in each (we made it smaller because of the performance, it was 1000 per request before).

One percolation request (a real one, tapped off of the php processes running) is taking around 2m20s under load (of the 16 php processes). That would've been ok if one of the resources on the EC2 was maxed out but that's the strange thing (see stats output here but also seen on htop, iotop and iostat): load, cpu, memory, heap, io; everything is well (very well) within limits. There doesn't seem to be a shortage of resources but still, percolation performance is bad.

When we back off the php processes and try the percolate request again, it comes out at around 15s. Just to be clear: I don't have a problem with a 2min+ multi percolate request. As long as I know that one of the resources is fully utilized (and I can act upon it by giving it more of what it wants).

So, ok, it's not the usual suspects, let's try different stuff:

  • To rule out network, coordination, etc issues we also did the same request from the node itself (enabling the client) with the same pressure from the php processes: no change
  • We upped the processors configuration in elasticsearch.yml and restarted the node to fake our way to a higher usage of resources: no change.
  • We tried tweaking the percolate and get pool size and queue size: no change.
  • When we looked at the hot threads, we DiscovereUsageTrackingQueryCachingPolicy was coming up a lot so we did as suggested in this issue : no change.
  • Maybe it's the amount of replicas, seeing Elasticsearch uses those to do searches as well? We upped it to 3 and used more EC2 to spread them out: no change.
  • To determine if we could actually use all resources on EC2, we did stress tests and everything seemed fine, getting it to loads of over 40. Also IO, memory, etc showed no issues under high strain.
  • It could still be the batch size. Under load we tried a batch of just one percolator in a multi percolate request, directly on the data & client node (dedicated to this index) and found that it used 1m50s. When we tried a batch of 200 percolators (still in one multi percolate request) it used 2m02s (which fits roughly with the 15s result of earlier, without pressure).

This last point might be interesting! It seems that it's stuck somewhere for a loooong time and then goes through the percolate phase quite smoothly.

Can anyone make anything out of this? Anything we have missed? We can provide more data if needed.

Have a look at the thread on the Elastic Discuss forum to see the solution.

TLDR; Use multiple nodes on one big server to get better resource utilization.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM