简体   繁体   中英

ElasticSearch/Logstash/Kibana How to deal with spikes in log traffic

What is the best way to deal with a surge in log messages being written to an ElasticSearch cluster in a standard ELK setup?

We use a standard ELK (ElasticSearch/Logstash/Kibana) set-up in AWS for our websites logging needs.

We have an autoscaling group of Logstash instances behind a load balancer, that log to an autoscaling group of ElasticSearch instances behind another load balancer. We then have a single instance serving Kibana.

For day to day business we run 2 Logstash instances and 2 ElasticSearch instances.

Our website experiences short periods of high level traffic during events - our traffic increases by about 2000% during these events. We know about these occurring events well in advance.

Currently we just increase the number of ElasticSearch instances temporarily during the event. However we have had issues where we have subsequently scaled down too quickly, meaning we have lost shards and corrupted our indexes.

I've been thinking of setting the auto_expand_replicas setting to "1-all" to ensure each node has a copy of all the data, so we don't need to worry about how quickly we scale up or down. How significant would the overhead of transferring all the data to new nodes be? We currently only keep about 2 weeks of log data - this works out around 50gb in all.

I've also seen people mention using a separate auto scaling group of non-data nodes to deal with increases of search traffic, while keep the number of data nodes the same. Would this help in a write heavy situation, such as the event I previously mentioned?

My Advice

Your best bet is using Redis as a broker in between Logstash and Elasticsearch:

提出的解决方案的架构图

This is described on some old Logstash docs but is still pretty relevant.

Yes, you will see a minimal delay between the logs being produced and them eventually landing in Elasticsearch, but it should be minimal as the latency between Redis and Logstash is relatively small. In my experience Logstash tends to work through the backlog on Redis pretty quickly.

This kind of setup also gives you a more robust setup where even if Logstash goes down, you're still accepting the events through Redis.

Just scaling Elasticsearch

As to your question on whether or not extra non-data nodes will help in write-heavy periods: I don't believe so, no. Non-data nodes are great when you're seeing lots of searches (reads) being performed, as they delegate the search to all the data nodes, and then aggregate the results before sending them back to the client. They take away the load of aggregating the results from the data nodes.

Writes will always involve your data nodes.

I don't think adding and removing nodes is a great way to cater for this.

You can try to tweak the thread pools and queues in your peak periods. Let's say normally you have the following:

threadpool:
  index:
    type: fixed
    size: 30
    queue_size: 1000
  search
    type: fixed
    size: 30
    queue_size: 1000

So you have an even amount of search and index threads available. Just before your peak time, you can change the setting ( on the run ) to the following:

threadpool:
  index:
    type: fixed
    size: 50
    queue_size: 2000
  search
    type: fixed
    size: 10
    queue_size: 500

Now you have a lot more threads doing indexing, allowing for a faster indexing throughput, while search is put on the backburner. For good measure I've also increased the queue_size to allow for more of a backlog to build up. This might not work as expected, though, and experimentation and tweaking is recommended.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM