简体繁体 English

Java - 记录多节点环境中的最佳实践

[英]Java - Logging best practices in multi-node environment

原文 2017-04-19 13:08:21 3 3 java/ http/ elasticsearch/ logback/ apache-httpclient-4.x

in my company I manage a large application (>100k users) in a multi-node infrastrucure, made of 3 (but potentially more) application servers. 在我的公司，我在一个多节点基础设施中管理一个大型应用程序（> 100k用户），由3个（但可能更多）应用程序服务器组成。 Each application server has 5 different log files on which are logged almost every information regarding http/s(REST or SOAP) requests and responses towards other (external) sub systems. 每个应用程序服务器都有5个不同的日志文件，其中几乎记录了有关http / s（REST或SOAP）请求和对其他（外部）子系统的响应的每个信息。 I use Apache Http client to handle REST flows, wsimport-generated clients for SOAP requests and Logback as logging technology. 我使用Apache Http客户端来处理REST流，将wsimport生成的客户端用于SOAP请求，使用Logback作为日志记录技术。

At the moment, when I'm asked to debug something, the most complex and time-consuming task I have to do is to identify the node on which I must debug. 目前，当我被要求调试某些东西时，我要做的最复杂和最耗时的任务是确定我必须调试的节点。 After that, I have to actually grep tons of lines in order to discover what happened. 在那之后，为了发现发生了什么，我必须实际上使用大量的线条。 To be honest I find it very boring and obsolete, as well as complex. 说实话，我发现它非常无聊和过时，也很复杂。

In order to make my life easier and make my logs more interesting, I had a look in the past days at the elasticsearch stack (elasticsearch, logstash, kibana) and played with their docker images. 为了让我的生活更轻松，让我的日志变得更有趣，我过去几天在弹性搜索堆栈（elasticsearch，logstash，kibana）上看了一下，并使用他们的docker图像进行了游戏。 I find them very interesting and I would like to introduce them in my application but before doing so I would like to know if are there best practices / pattern to do something similar. 我发现它们非常有趣，我想在我的应用程序中介绍它们但在此之前我想知道是否有最佳实践/模式来做类似的事情。

These are my doubts: 这些是我的疑惑：

Is there a best practice in logging Http/s REST and SOAP requests/responses (i need to see everything: url, headers, path, body, cookies, ... ) in a format that can be easily parsed by logstash/elasticsearch ? 是否有一种最佳实践来记录Http / s REST和SOAP请求/响应（我需要查看所有内容：url，headers，path，body，cookies，...），格式可以通过logstash / elasticsearch轻松解析？
Considering my infrastructure, should I use an elasticsearch appender for my logback implementation or make use of Logstash as log processor (I suppose one for each application server) ? 考虑到我的基础设施，我应该使用elasticsearch appender进行我的logback实现，还是使用Logstash作为日志处理器（我想每个应用服务器都有一个）？
Are there valid alternatives to logback and elasticsearch technlogies to fulfill my requirement ? 是否有有效的logback和elasticsearch技术替代方案来满足我的要求？

I don't expect a simple and easy answer. 我不指望一个简单易行的答案。 I would like to read about different experiences in order to make the choice that best fits my solution 我想了解不同的经验，以便做出最适合我的解决方案的选择

Thanks! 谢谢！

3 个解决方案

Here is one of those complicated answers :) ELK stack is definitely something that can make your life much easier in distributed environments. 这是一个复杂的答案:) ELK堆栈绝对可以让你的生活在分布式环境中更容易。 However in order to benefit from it you can consider the following practices: 但是，为了从中受益，您可以考虑以下做法：

In log message that comes to ElasticSearch you should see the following (besides obvious time, level, a message itself): 在ElasticSearch的日志消息中，您应该看到以下内容（除了明显的时间，级别，消息本身）：
- The server that has produced a lot message 生成大量消息的服务器
- The user that has originated a request 发起请求的用户
- If your application is multi tenant - the tenant under which the request has been proceeded 如果您的申请是多租户 - 请求已经进行的租户
All messages should be of the same structure (layout) 所有消息应该具有相同的结构（布局）
Since you use Java, Exceptions can become a potential issue (they're multi-lined) so they need a special treatment. 由于您使用Java，异常可能成为一个潜在的问题（它们是多线的），因此需要特殊处理。 Logstash can deal with it though Logstash可以处理它
If your flow can spread over different servers (you say you have 3 and potentially more) consider generating a special correlation id per request. 如果您的流可以分布在不同的服务器上（您说您有3个且可能更多），请考虑为每个请求生成一个特殊的相关ID。 Some random number that can identify a flow. 一些可以识别流量的随机数。

All this can help to apply filters and benefit from Elastic Search even more. 所有这些都有助于应用过滤器并从Elastic Search中获益更多。

Consider using TTLs for logs. 考虑将TTL用于日志。 Probably you won't need to keep the logs created before more than a week or two. 可能您不需要在超过一周或两周之前保留创建的日志。

Now regarding HTTP requests. 现在关于HTTP请求。 Usually logging just everything can be a security issue, because you can't be sure that some "sensitive" information won't be logged. 通常只记录所有内容可能是一个安全问题，因为您无法确定是否会记录某些“敏感”信息。 So you'll want to keep this protected (at least your security guy will want :) ). 所以你要保持这种保护（至少你的安全人员会想:)）。 Probably logging URL, server, http method and some user identifier (or tenant if needed) will be sufficient, but its solely my opinion. 可能记录URL，服务器，http方法和一些用户标识符（或者如果需要的话，租户）就足够了，但这完全是我的意见。

Now regarding the appender vs logstash (files) approach. 现在关于appender vs logstash（文件）方法。 Both approaches have pros and contras. 这两种方法都有优点和反差。 For instance: If you use logstash approach you'll have to parse every single line of your log files. 例如：如果使用logstash方法，则必须解析日志文件的每一行。 If your application produces many logs it can influence the performance since parsing can by CPU costy (especially if you use grok filter in logstash). 如果您的应用程序生成许多日志，它可能会影响性能，因为解析可以通过CPU costy（特别是如果您在logstash中使用grok过滤器）。 On the other hand - appenders allow to avoid parsing altogether (you'll have all the information in memory in java). 另一方面 - appenders允许完全避免解析（你将在java中的内存中获得所有信息）。

On the other hand, appenders should be set up carefully. 另一方面，应仔细设置appender。 I don't have an experience with logback Elasticsearch appender, but I think it should be at least: 我没有使用logback Elasticsearch appender的经验，但我认为它至少应该是：

asynchronous (otherwise your business flow can stuck) 异步（否则您的业务流程可能会卡住）
able to cope with failures (you won't want to throw an exception only because currently ES is not available or something. An end user shouldn't really feel this). 能够应对失败（你不会因为当前的ES不可用或其他东西而抛出异常。最终用户不应该真正感受到这一点）。
Probably maintain some queues / use disruptor under the hood, only because you can produce way more logs than your appender may be able to send to ES, so eventually the log messages will be lost. 可能维护一些队列/使用干扰器，只是因为你可以产生比你的appender可能发送给ES更多的日志，所以最终日志消息将丢失。 For example if you have a queue of size 1000, and there are more than 1000 messages in the log, you can image what FIFO will do. 例如，如果您有一个大小为1000的队列，并且日志中有超过1000条消息，则可以映像FIFO将执行的操作。

Yet Another thing to consider: Lets imagine that for some reason there is an issue with one of the application servers. 还有一件事需要考虑：让我们想象一下，由于某些原因，其中一个应用服务器存在问题。 So you'll probably want to restart it (gracefully or not). 所以你可能想要重新启动它（优雅与否）。 If you use im-memory appender what will happen with these messages? 如果你使用im-memory appender这些消息会发生什么？ would you like to see them in ElasticSearch to analyze post mortum? 你想在ElasticSearch中看到它们来分析post mortum吗？ So, bottom line in-memory approach can't deal with restarts. 因此，底线内存中的方法无法处理重启。

On the other hand, whatever will be stored in file will be happily processed by logstash process. 另一方面，logstash进程将很乐意处理存储在文件中的任何内容。

As for alternative approaches to appender vs. logstash, probably you may consider using Apache Flume as a transport. 至于appender与logstash的替代方法，您可能会考虑使用Apache Flume作为传输。 If you go with an appender approach you can use an embedded flume agent and write a very good appender on top of it. 如果你采用appender方法，你可以使用嵌入式水槽代理，并在它上面写一个非常好的appender。 Flume will provide disk based persistency, transaction like api and so forth. Flume将提供基于磁盘的持久性，像api这样的事务等等。

Having said that, many people just go with logstash approach as far as I know. 话虽如此，据我所知，很多人只是采用了logstash方法。

One more thing, probably the last one that come to my mind: 还有一件事，可能是我想到的最后一件事：

You shouldn't really write directly to ElasticSearch. 你不应该直接写ElasticSearch。 Instead use some intermediate server (in logstash it can be Redis or RabbitMQ). 而是使用一些中间服务器（在logstash中它可以是Redis或RabbitMQ）。 In flume approach you can use just yet anothe flume process (with scale out option support out of the box). 在水槽方法中，您可以使用其他水槽工艺（开箱即用的扩展选项支持）。

This will allow you to abstract ElasticSearch out architectural-wise and apply some additional processing on log stash server (it can pull data from Redis/get receive messages from RabbitMQ). 这将允许您在架构方面抽象ElasticSearch并在日志存储服务器上应用一些额外的处理（它可以从RedM获取数据/从RabbitMQ获取接收消息）。 In flume the similar behaviour is achievable as well. 在水槽中，类似的行为也是可以实现的。

Hops this helps 啤酒花有帮助

Is there a best practice […] format [/protocol]? 是否有最佳实践[/]格式[/ protocol]？

I am not aware of any logging standard that already has the fields that you want. 我不知道任何已有您想要的字段的日志标准。 so you'll need a format that lets you store custom metadata. 所以你需要一种可以存储自定义元数据的格式。 you can add metadata to syslog messages using the RFC5424 format. 您可以使用RFC5424格式向syslog消息添加元数据。 I have also seen various log services accept JSON-formatted messages over a socket connection. 我还看到各种日志服务通过套接字连接接受JSON格式的消息。

should I use an elasticsearch appender? 我应该使用elasticsearch appender吗？

I recommend sending directly to logstash rather than sending directly to ElasticSearch. 我建议直接发送到logstash，而不是直接发送到ElasticSearch。

Logstash is designed to receive & parse messages in a variety of formats, so it will be easier to send the message in a format logstash understands than in a format ElasticSearch understands. Logstash旨在以各种格式接收和解析消息，因此以logstash理解的格式比以ElasticSearch理解的格式更容易发送消息。
As your logging requirements evolve: you will be able to make the change in one place — Logstash — instead of reconfiguring your every application instance. 随着您的日志记录需求的发展：您将能够在一个地方进行更改 - Logstash - 而不是重新配置您的每个应用程序实例。
- This includes operational changes, such as changing the address of your ElasticSearch cluster. 这包括操作更改，例如更改ElasticSearch集群的地址。
Logstash can do things such as censor logs (remove things that look like passwords or addresses) Logstash可以执行诸如审查日志之类的操作（删除看起来像密码或地址的内容）
Logstash can send logs to a variety of downstream services. Logstash可以将日志发送到各种下游服务。 For example: it can trigger PagerDuty notifications or Slack messages if you encounter an important error. 例如：如果遇到重要错误，它可以触发PagerDuty通知或Slack消息。
Logstash can enrich log messages with additional metadata (eg decipher geo-coordinates from IP addresses) Logstash可以使用其他元数据丰富日志消息（例如，从IP地址解密地理坐标）

There are likely scale concerns as well. 也可能存在规模问题。 I am not knowledgeable enough to comment on these, but here's my gut feeling: I expect that Logstash is designed to handle a large number of connections well (and to gracefully handle connection failures). 我不足以对这些进行评论，但这是我的直觉：我希望Logstash能够很好地处理大量连接（以及优雅地处理连接失败）。 I don't know whether this is a similar priority in the design of an ElasticSearch cluster, or whether ElasticSearch's search performance would be impacted by having a large number of agents connected to it at once. 我不知道这在ElasticSearch集群的设计中是否具有类似的优先级，或者ElasticSearch的搜索性能是否会受到同时连接到它的大量代理的影响。 I am more confident that Logstash is designed with this kind of use in mind. 我更有信心Logstash的设计考虑到了这种用途。

You may also find that there are limitations of the ElasticSearch appender. 您可能还发现ElasticSearch appender存在局限性。 The appender needs to have good support for a number of things. appender需要对许多事情有很好的支持。 The first things that come to mind are: 首先想到的是：

choice of protocol, encryption 协议选择，加密
choice of compression 选择压缩
full control over the format of the log message (including custom fields) 完全控制日志消息的格式（包括自定义字段）
control over special messages such as exceptions are sent 控制特殊消息（例如异常）被发送

You can avoid any limitations of a technology-specific appender by sticking to a well-supported standard (ie like the syslog appender). 您可以通过坚持一个良好支持的标准（例如syslog appender）来避免技术特定的appender的任何限制。

Are there valid alternatives to logback and elasticsearch technlogies to fulfill my requirement ? 是否有有效的logback和elasticsearch技术替代方案来满足我的要求？

Do you mean to say logstash (ie "is there an alternative to the ELK stack"?) If that's your intention, then I don't have an answer. 你的意思是说logstash （即“是否有替代ELK堆栈”？）如果这是你的意图，那么我没有答案。

But in terms of alternatives to logback … I use log4j2. 但就logback的替代方案而言......我使用log4j2。 It provides async logging, to reduce performance burden on your application. 它提供异步日志记录，以减轻应用程序的性能负担。 Maybe logback has this feature too. 也许logback也有这个功能。 Sending custom fields in log4j2 log messages is hard (currently there is poor support for escaping JSON. plugins are available, but your build needs to be setup correctly to support these). 在log4j2日志消息中发送自定义字段很难（目前很难支持转义JSON。插件可用，但您的构建需要正确设置才能支持这些）。 The most easy route for me was to use the RFC5424 syslog appender. 对我来说最简单的方法是使用RFC5424 syslog appender。

Consider designing your Java application to invoke a logging facade (ie SLF4J), rather than directly invoking logback. 考虑设计Java应用程序以调用日志记录（即SLF4J），而不是直接调用logback。 This enables you to trivially switch to a different logging provider in future. 这使您可以在将来轻松切换到其他日志记录提供程序。

I have same problem as you and decided to avoid any intermediate log gatherer (like Logstash/Flume). 我和你有同样的问题，并决定避免任何中间日志收集器（如Logstash / Flume）。

https://github.com/internetitem/logback-elasticsearch-appender is not ideal at present but it config is more elastic then https://github.com/logstash/logstash-logback-encoder https://github.com/internetitem/logback-elasticsearch-appender目前并不理想但它的配置更具弹性然后https://github.com/logstash/logstash-logback-encoder

For example logstash-logback-encoder fixes names for standard fields of https://logback.qos.ch/apidocs/ch/qos/logback/classic/spi/ILoggingEvent.html 例如， logstash-logback-encoder修复了https://logback.qos.ch/apidocs/ch/qos/logback/classic/spi/ILoggingEvent.html标准字段的名称

logback-elasticsearch-appender currently lacks persistence to local FS storage if ring is full and lacks iteration over available ES servers (only one is possible to specify). 如果响铃已满并且缺少对可用ES服务器的迭代（目前只能指定一个），则logback-elasticsearch-appender当前缺乏对本地FS存储的持久性。

Please note that Logstash is not fail safe by default, from https://www.elastic.co/guide/en/logstash/current/persistent-queues.html 请注意，默认情况下，Logstash不是故障安全的，来自https://www.elastic.co/guide/en/logstash/current/persistent-queues.html

By default, Logstash uses in-memory bounded queues between pipeline stages (inputs → pipeline workers) to buffer events. 默认情况下，Logstash在管道阶段（输入→管道工作者）之间使用内存中有界队列来缓冲事件。 The size of these in-memory queues is fixed and not configurable 这些内存中队列的大小是固定的，不可配置

So you need to invent some schema with Redis, RabbitMQ, or Kafka. 因此，您需要使用Redis，RabbitMQ或Kafka创建一些模式。 In my view ES cluster is much safer then Logstash (ES safety is in Elasic.io advertisement). 在我看来，ES集群比Logstash（ES安全在Elasic.io广告中）更安全。

Also note that Logstash implemented in Ruby and so single threaded app ! 另请注意Logstash在Ruby中实现，因此单线程应用程序 ！ We can't talk about scalability here. 我们不能在这里谈论可扩展性。 Expect up to 10000 req/s (that is typical number from performance reports I found over Internet). 期望高达10000 req / s（这是我在Internet上找到的性能报告中的典型数字）。

Flume have better performance. 水槽有更好的性能。 I saw that it lacks documentation. 我看到它缺乏文档。 Get ready to ask questions on mail lists )) 准备好在邮件列表上提问））

There are a lot of commercial offers: 有很多商业优惠：

Splunk <http://www.splunk.com/en_us/products/splunk-light.html>
Scalyr <https://www.scalyr.com/pricing>
Graylog <https://www.graylog.org/support-packages/>
Loggly <https://www.loggly.com/product/>
Motadata <https://www.motadata.com/elk-stack-alternative/>

They costs thousands dollars per year for wise reasons. 由于明智的原因，它们每年花费数千美元。

You can check how is hard to design good appender from one of the log gathering vendor: https://logz.io/blog/lessons-learned-writing-new-logback-appender/ 您可以检查一个日志收集供应商如何设计好的appender： https ： //logz.io/blog/lessons-learned-writing-new-logback-appender/

With centralized logging solution you should change the way you log: 使用集中式日志记录解决方案，您应该更改记录方式：

Add context to https://www.slf4j.org/api/org/slf4j/MDC.html That can be client phone number, IP address, ticket number or whatever else you have. 将上下文添加到https://www.slf4j.org/api/org/slf4j/MDC.html这可以是客户电话号码，IP地址，票号或其他任何内容。 You need a way to quickly filter important data . 您需要一种快速过滤重要数据的方法 。
Start using https://www.slf4j.org/api/org/slf4j/Marker.html for unexpected incidents that require immediate reaction. 开始使用https://www.slf4j.org/api/org/slf4j/Marker.html进行需要立即反应的意外事件。 Don't hide or ignore problems ! 不要隐瞒或忽视问题 ！
Plan how to name MDC params and Markers and document them so operational team would know what happen without calling you at midnight. 计划如何命名MDC参数和标记并记录它们，以便操作团队知道发生了什么，而不是在午夜打电话给你。
Set replication in ES cluster. 在ES群集中设置复制。 That allows you to shut down part of ES nodes for maintainace. 这允许您关闭部分ES节点以进行维护。