简体   繁体   English

ElasticSearch / Logstash / Kibana如何处理日志流量中的峰值

[英]ElasticSearch/Logstash/Kibana How to deal with spikes in log traffic

What is the best way to deal with a surge in log messages being written to an ElasticSearch cluster in a standard ELK setup? 处理标准ELK设置中写入ElasticSearch集群的日志消息激增的最佳方法是什么?

We use a standard ELK (ElasticSearch/Logstash/Kibana) set-up in AWS for our websites logging needs. 我们在AWS中使用标准ELK(ElasticSearch / Logstash / Kibana)设置来满足我们网站的日志记录需求。

We have an autoscaling group of Logstash instances behind a load balancer, that log to an autoscaling group of ElasticSearch instances behind another load balancer. 我们在负载均衡器后面有一个自动缩放的Logstash实例组,该组记录到另一个负载均衡器后面的ElasticSearch实例的自动缩放组。 We then have a single instance serving Kibana. 然后我们有一个服务Kibana的实例。

For day to day business we run 2 Logstash instances and 2 ElasticSearch instances. 对于日常业务,我们运行2个Logstash实例和2个ElasticSearch实例。

Our website experiences short periods of high level traffic during events - our traffic increases by about 2000% during these events. 我们的网站在活动期间经历短时间的高流量 - 在这些活动期间,我们的流量增加了约2000%。 We know about these occurring events well in advance. 我们提前知道这些事件。

Currently we just increase the number of ElasticSearch instances temporarily during the event. 目前,我们只是在事件期间临时增加ElasticSearch实例的数量。 However we have had issues where we have subsequently scaled down too quickly, meaning we have lost shards and corrupted our indexes. 然而,我们遇到的问题是我们随后缩减得太快,这意味着我们丢失了碎片并损坏了我们的索引。

I've been thinking of setting the auto_expand_replicas setting to "1-all" to ensure each node has a copy of all the data, so we don't need to worry about how quickly we scale up or down. 我一直在考虑将auto_expand_replicas设置为"1-all"以确保每个节点都拥有所有数据的副本,因此我们无需担心扩展或缩小的速度。 How significant would the overhead of transferring all the data to new nodes be? 将所有数据传输到新节点的开销有多大? We currently only keep about 2 weeks of log data - this works out around 50gb in all. 我们目前只保留大约2周的日志数据 - 总共大约50gb。

I've also seen people mention using a separate auto scaling group of non-data nodes to deal with increases of search traffic, while keep the number of data nodes the same. 我还看到人们提到使用单独的自动缩放组的非数据节点来处理搜索流量的增加,同时保持数据节点的数量相同。 Would this help in a write heavy situation, such as the event I previously mentioned? 这有助于处理繁重的情况,例如我之前提到过的事件吗?

My Advice 我的建议

Your best bet is using Redis as a broker in between Logstash and Elasticsearch: 您最好的选择是使用Redis作为Logstash和Elasticsearch之间的代理:

提出的解决方案的架构图

This is described on some old Logstash docs but is still pretty relevant. 这在一些旧的Logstash文档中有所描述,但仍然非常相关。

Yes, you will see a minimal delay between the logs being produced and them eventually landing in Elasticsearch, but it should be minimal as the latency between Redis and Logstash is relatively small. 是的,您将看到正在生成的日志与最终登陆Elasticsearch之间的最小延迟,但它应该是最小的,因为Redis和Logstash之间的延迟相对较小。 In my experience Logstash tends to work through the backlog on Redis pretty quickly. 根据我的经验,Logstash很快就可以通过Redis的积压工作。

This kind of setup also gives you a more robust setup where even if Logstash goes down, you're still accepting the events through Redis. 这种设置还为您提供了更强大的设置,即使Logstash发生故障,您仍然可以通过Redis接受这些事件。

Just scaling Elasticsearch 只是缩放Elasticsearch

As to your question on whether or not extra non-data nodes will help in write-heavy periods: I don't believe so, no. 至于你的问题是关于额外的非数据节点是否会有助于写入繁重的时期:我不相信,不。 Non-data nodes are great when you're seeing lots of searches (reads) being performed, as they delegate the search to all the data nodes, and then aggregate the results before sending them back to the client. 当您看到正在执行大量搜索(读取)时,非数据节点很棒,因为它们将搜索委托给所有数据节点,然后在将结果发送回客户端之前聚合结果。 They take away the load of aggregating the results from the data nodes. 它们消除了聚合数据节点结果的负担。

Writes will always involve your data nodes. 写入将始终涉及您的数据节点。

I don't think adding and removing nodes is a great way to cater for this. 我不认为添加和删除节点是一个很好的方式来迎合这一点。

You can try to tweak the thread pools and queues in your peak periods. 您可以尝试在高峰期调整线程池和队列 Let's say normally you have the following: 我们通常说你有以下几点:

threadpool:
  index:
    type: fixed
    size: 30
    queue_size: 1000
  search
    type: fixed
    size: 30
    queue_size: 1000

So you have an even amount of search and index threads available. 因此,您可以获得均匀的搜索和索引线程。 Just before your peak time, you can change the setting ( on the run ) to the following: 在高峰时间之前,您可以将设置( 在运行中 )更改为以下内容:

threadpool:
  index:
    type: fixed
    size: 50
    queue_size: 2000
  search
    type: fixed
    size: 10
    queue_size: 500

Now you have a lot more threads doing indexing, allowing for a faster indexing throughput, while search is put on the backburner. 现在你有更多的线程正在进行索引,允许更快的索引吞吐量,而搜索则放在backburner上。 For good measure I've also increased the queue_size to allow for more of a backlog to build up. 为了更好的衡量,我还增加了queue_size以允许更多的积压来建立。 This might not work as expected, though, and experimentation and tweaking is recommended. 但是,这可能无法按预期工作,建议进行实验和调整。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM