简体繁体 English

ELK Stack的网络容错架构

[英]A network fault-tolerant architecture for ELK Stack

原文 2016-12-31 08:25:34 1 1 architecture/ elastic-stack

It's just only a few days I'm acquainted with ELK Stack . 我只有几天熟悉ELK Stack 。 We're trying to use it in our enterprise applications but have some architectural concerns. 我们正在尝试在企业应用程序中使用它，但存在一些体系结构方面的问题。 I've seen & read some use cases of ELK and their architectures, especially in linkedin , but no one have discussed about the potential effect of the network errors on his/her architecture. 我已经看过并阅读了ELK及其体系结构的一些用例，尤其是在linkedin中，但是没有人讨论过网络错误对其体系结构的潜在影响。

In traditional applications, which usually logs are written into the files, the only reason that can cause to system crash is Disk is Full error that is really rare. 在通常将日志写入文件的传统应用程序中，可能导致系统崩溃的唯一原因是“ Disk is Full错误，这种情况很少发生。 But in a centralized log system, which the logs are sent via network, since the network errors are very common I think the system is highly crash-prone!! 但是在通过网络发送日志的集中式日志系统中，由于网络错误非常普遍，我认为该系统极易发生崩溃！ Especially in/for the corps with unreliable network. 特别是在网络不可靠的部队中。

Furthermore, as I've seen in many ELK use cases, a single instance of a JMS Provider or in other words a Pub/Sub Provider like Kafka or Redis is used along with the ELK . 此外，正如我在许多ELK用例中所看到的那样，将JMS Provider的单个实例，或者换句话说，将Kafka或Redis类的Pub/Sub Provider与ELK一起使用。 I think in addition to the previous problem, the JMS Provider is a single point of failure in these architectures! 我认为除了先前的问题外， JMS Provider还是这些体系结构中的single point of failure ！ Unless, that would be clustered. 除非，否则将被聚类。

I think we can get rid of the both problems if we use a JMS Provider like Kafka alongside each Shipper[s] on a single node as follows (one Kafka for each node): 我认为，如果像下面这样在单个节点上与每个Shipper[s]一起使用像Kafka这样的JMS Provider ，则可以摆脱这两个问题（每个节点一个Kafka ）：

((log-generator)+ (logstash)? Kafka)* -> Logstash -> Elasticsearch -> Kibana

Please, let me know if this architecture makes sense? 请让我知道这种架构是否有意义？
If it doesn't makes, any other fault tolerant architecture will be welcome :) 如果没有成功，将欢迎其他任何容错架构:)

1 个解决方案

The answer depends on how much risk is allowed, where you might expect to encounter such risk, and how long you expect an incident to last. 答案取决于所允许的风险程度，您可能希望在何处遇到此类风险以及事件持续多长时间。

If you write to local files, you can use Filebeat to ship the files to a remote logstash. 如果写入本地文件，则可以使用Filebeat将文件发送到远程logstash。 If that logstash (or the downstream Elasticsearch cluster) applies back-pressure, filebeat will slow down or stop sending logs. 如果该logstash（或下游Elasticsearch群集）施加了反压力，则filebeat将减慢速度或停止发送日志。 This provides you with a distributed cache on the remote machines (no broker required). 这为您提供了远程计算机上的分布式缓存（不需要代理）。 The downside is that, if the outage is long-lasting, the log file might be rotated out from under filebeat's glob pattern, and then it will never ship. 不利的一面是，如果中断是长期的，则日志文件可能会从filebeat的glob模式下转出，然后它将永远无法发送。

With multiple logstash instances, you can configure filebeat to ship to a list of them, thus providing some survivability. 对于多个logstash实例，您可以配置filebeat以将其发送到它们的列表，从而提供一些生存能力。 If you have "one-time" events (like snmptraps, syslog, etc), you'll want to think about the possible outages a little more. 如果您有“一次性”事件（例如snmptraps，syslog等），则需要考虑可能的中断。

I used to run a separate logstash instance for these types of events, which would feed into redis. 我曾经为这些类型的事件运行单独的logstash实例，这些实例将馈入redis。 The main logstash (when up) would then read from the queue and process the events. 然后，主logtash（启动时）将从队列中读取并处理事件。 This allowed me to launch a new logstash config without fear of losing events. 这使我可以启动新的logstash配置，而不必担心丢失事件。 These days, I try to write events to files (with snmptrapd, etc), and not depend on any logstash running 24x7x365. 这些天来，我尝试将事件写入文件（带有snmptrapd等），而不依赖于任何运行24x7x365的logstash。