简体繁体 English

是否可以使用 linkerd 追踪非常罕见的失败请求？

[英]Is it possible to track down very rare failed requests using linkerd?

原文 2019-12-12 14:09:25 9 1 kubernetes/ linkerd

Linkerd's docs explain how to track down failing requests using the tap command, but in some cases the success rate might be very high, with only a single failed request every hour or so. Linkerd 的文档解释了如何使用tap命令追踪失败的请求，但在某些情况下，成功率可能非常高，每小时左右只有一个失败的请求。 How is it possible to track down those requests that are considered "unsuccessful"?如何追踪那些被视为“不成功”的请求？ Perhaps a way to log them somewhere?也许是一种将它们记录在某处的方法？

1 个解决方案

It sounds like you're looking for a way to configure Linkerd to trap requests that fail and dump the request data somewhere, which is not supported by Linkerd at the moment.听起来您正在寻找一种方法来配置 Linkerd 以捕获失败的请求并将请求数据转储到某处，Linkerd 目前不支持这种方法。

You do have a couple of options with the current functionality to derive some of the info that you're looking for.对于当前的功能，您确实有几个选项可以获取您正在寻找的一些信息。 The Linkerd proxies record error rates as Prometheus metrics which are consumed by Grafana to render the dashboards. Linkerd 代理将错误率记录为 Prometheus 指标，Grafana 使用这些指标来呈现仪表板。 When you observe one of these infrequent errors, you can use the time window functionality in Grafana to find the precise time that the error occurred, then refer to the service log to see if there are any corresponding error messages there.当您观察到这些不常见的错误之一时，您可以使用 Grafana 中的时间窗口功能来查找错误发生的准确时间，然后参考服务日志以查看那里是否有任何相应的错误消息。 If the error is coming from the service itself, then you can add as much logging info about the request that you need to in order to help solve the problem.如果错误来自服务本身，那么您可以添加尽可能多的关于请求的日志信息，以帮助解决问题。

Another option, which I haven't tried myself is to integrate linkerd tap into your monitoring system to collect the request info and save the data for the requests that fail.我自己还没有尝试过的另一种选择是将linkerd tap集成到您的监控系统中以收集请求信息并保存失败请求的数据。 There's a caveat here in that you will want to be careful about leaving a tap command running, because it will continuously collect data from the tap control plane component, which will add load to that service.这里有一个警告，您需要小心让 tap 命令运行，因为它会不断从 tap 控制平面组件收集数据，这将增加该服务的负载。

Perhaps a more straightforward approach would be to ensure that all the proxy logs and service logs are written to a long-term store like Splunk, an ELK (Elasticsearch, Logstash, and Kibana), or Loki.也许更直接的方法是确保所有代理日志和服务日志都写入长期存储，例如 Splunk、ELK（Elasticsearch、Logstash 和 Kibana）或 Loki。 Then you can set up alerting (Prometheus alert-manager, for example) to send a notification when a request fails, then you can match the time of the failure with the logs that have been collected.然后您可以设置警报（例如 Prometheus alert-manager）以在请求失败时发送通知，然后您可以将失败的时间与已收集的日志进行匹配。

You could also look into adding distributed tracing to your environment.您还可以考虑将分布式跟踪添加到您的环境中。 Depending on the implementation that you use (jaeger, zipkin, etc.) I think the interface will allow you to inspect the details of the request for each trace.根据您使用的实现（jaeger、zipkin 等），我认为该界面将允许您检查每个跟踪请求的详细信息。

One final thought: since Linkerd is an open source project , I'd suggest opening a feature request with specifics on the behavior that you'd like to see and work with the community to get it implemented.最后一个想法：由于 Linkerd 是一个开源项目，我建议打开一个功能请求，详细说明您希望看到的行为并与社区合作以实现它。 I know the roadmap includes plans to be able to see the request bodies using linkerd tap and this sounds like a good use case for having those bodies.我知道路线图包括能够使用linkerd tap查看请求主体的linkerd tap ，这听起来像是拥有这些主体的一个很好的用例。