简体   繁体   中英

A network fault-tolerant architecture for ELK Stack

It's just only a few days I'm acquainted with ELK Stack . We're trying to use it in our enterprise applications but have some architectural concerns. I've seen & read some use cases of ELK and their architectures, especially in linkedin , but no one have discussed about the potential effect of the network errors on his/her architecture.

In traditional applications, which usually logs are written into the files, the only reason that can cause to system crash is Disk is Full error that is really rare. But in a centralized log system, which the logs are sent via network, since the network errors are very common I think the system is highly crash-prone!! Especially in/for the corps with unreliable network.

Furthermore, as I've seen in many ELK use cases, a single instance of a JMS Provider or in other words a Pub/Sub Provider like Kafka or Redis is used along with the ELK . I think in addition to the previous problem, the JMS Provider is a single point of failure in these architectures! Unless, that would be clustered.

I think we can get rid of the both problems if we use a JMS Provider like Kafka alongside each Shipper[s] on a single node as follows (one Kafka for each node):

((log-generator)+ (logstash)? Kafka)* -> Logstash -> Elasticsearch -> Kibana

Please, let me know if this architecture makes sense?
If it doesn't makes, any other fault tolerant architecture will be welcome :)

The answer depends on how much risk is allowed, where you might expect to encounter such risk, and how long you expect an incident to last.

If you write to local files, you can use Filebeat to ship the files to a remote logstash. If that logstash (or the downstream Elasticsearch cluster) applies back-pressure, filebeat will slow down or stop sending logs. This provides you with a distributed cache on the remote machines (no broker required). The downside is that, if the outage is long-lasting, the log file might be rotated out from under filebeat's glob pattern, and then it will never ship.

With multiple logstash instances, you can configure filebeat to ship to a list of them, thus providing some survivability. If you have "one-time" events (like snmptraps, syslog, etc), you'll want to think about the possible outages a little more.

I used to run a separate logstash instance for these types of events, which would feed into redis. The main logstash (when up) would then read from the queue and process the events. This allowed me to launch a new logstash config without fear of losing events. These days, I try to write events to files (with snmptrapd, etc), and not depend on any logstash running 24x7x365.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM