简体繁体 English

Kubernetes Pod 使用 TCP 连接正常关闭（Spring 启动）

[英]Kubernetes pods graceful shutdown with TCP connections (Spring boot)

原文 2020-02-18 08:11:58 1 3 spring-boot/ kubernetes/ azure-aks/ horizontal-pod-autoscaling

I am hosting my services on azure cloud, sometimes I get "BackendConnectionFailure" without any apparent reason, after investigation I found a correlation between this exception and autoscale (scaling down) almost at the same second in most of the cases.我在 azure 云上托管我的服务，有时我会在没有任何明显原因的情况下收到“BackendConnectionFailure”，经过调查，我发现在大多数情况下，此异常与自动缩放（按比例缩小）之间几乎同时发生相关性。

According to documentation termination grace period by default is 30 seconds, which is the case.根据文档，默认终止宽限期是 30 秒，就是这种情况。 The pod will be marked terminating and the loadbalancer will not consider it anymore, so receiving no more requests. Pod 将被标记为终止，负载均衡器将不再考虑它，因此不会再收到请求。 According to this if my service takes far less time than 30 seconds, I should not need prestop hook or any special implementation in my application (please correct me if I am wrong).据此，如果我的服务花费的时间远少于 30 秒，则我的应用程序中不需要 prestop 钩子或任何特殊实现（如果我错了，请纠正我）。

If the previous paragraph is correct, why does this exception occur relatively frequent?如果上一段是正确的，为什么这个异常发生的比较频繁呢？ My thought is when the pod is marked terminating and the loadbalancer does not forward anymore requests to the pod while it should do.我的想法是当 pod 被标记为终止并且负载均衡器在它应该做的时候不再向 pod 转发请求。

Edit 1:编辑1：

The Architecture is simply like this架构就是这样

Client -> Firewall(azure) -> API(azure APIM) -> Microservices(Spring boot) -> backend(third party) or azure RDB depending on the service Client -> Firewall(azure) -> API(azure APIM) -> Microservices(Spring boot) -> backend(第三方)或azure RDB，具体取决于服务

I think the Exception comes from APIM, I found two patterns for this exception:我认为异常来自 APIM，我发现这个异常有两种模式：

Message The underlying connection was closed: The connection was closed unexpectedly. Exception type BackendConnectionFailure Failed method forward-request

Response time 10.0 s

Message The underlying connection was closed: A connection that was expected to be kept alive was closed by the server. Exception type BackendConnectionFailure Failed method forward-request

Response time 3.6 ms

3 个解决方案

Spring Boot doesn't do graceful termination by default.默认情况下，Spring Boot不会正常终止。

The Spring Boot app and it's application container (not linux container) are in control of what happens to existing connections during the termination grace period.在终止宽限期内，Spring Boot 应用程序及其应用程序容器（不是 linux 容器）控制现有连接发生的情况。 The protocols being used and how a client reacts to a "close" also have a part to play.所使用的协议以及客户端对“关闭”的反应也有一定的作用。

If you get to the end of the grace period, then everything gets a hard reset.如果您到达宽限期，那么一切都会硬重置。

Kubernetes Kubernetes

When a pod is deleted in k8s , the Pod Endpoint removal from Services is triggered at the same time as the SIGTERM signal to the container(s). 当在 k8s 中删除 pod时，会在向容器发送SIGTERM信号的同时触发从服务中删除Pod 端点。

At this point the cluster nodes will be reconfigured to remove any rules directing new traffic to the Pod.此时，集群节点将被重新配置以删除将新流量定向到 Pod 的任何规则。 Any existing TCP connections to the Pod/containers will remain in connection tracking until they are closed (by the client, server or network stack).与 Pod/容器的任何现有 TCP 连接将保持连接跟踪，直到它们被关闭（由客户端、服务器或网络堆栈）。

For HTTP Keep Alive or HTTP/2 services, the client will continue hitting the same Pod Endpoint until it is told to close the connection (or it is forcibly reset)对于 HTTP Keep Alive 或 HTTP/2 服务，客户端将继续访问同一个 Pod Endpoint，直到它被告知关闭连接（或强制重置）

App应用程序

The basic rules are, on SIGTERM the application should:基本规则是，在 SIGTERM 上，应用程序应该：

Allow running transactions to complete允许正在运行的事务完成
Do any application cleanup required执行任何需要的应用程序清理
Stop accepting new connections, just in case停止接受新连接，以防万一
Close any inactive connections it can (keep alive requests, websockets)关闭它可以的任何非活动连接（保持活动请求，websockets）

Some circumstances you might not be able to handle (depends on the client)有些情况您可能无法处理（取决于客户）

A keep alive connection that doesn't complete a request in the grace period, can't get a Connection: close header.未在宽限期内完成请求的保持活动连接无法获得Connection: close标头。 It will need a TCP level FIN close.它将需要 TCP 级别的 FIN 关闭。
A slow client with a long transfer, in a one way HTTP transfer these will have to be waited for or forcibly closed.传输时间长的慢速客户端，以一种方式进行 HTTP 传输，必须等待或强行关闭这些客户端。

Although keepalive clients should respect a TCP FIN close, every client reacts differently.尽管保持连接的客户端应该尊重 TCP FIN 关闭，但每个客户端的反应都不同。 Microsoft APIM might be sensitive and produce the error even though there was no real world impact.即使没有实际影响，Microsoft APIM 也可能很敏感并产生错误。 It's best to load test your setup while scaling to see if there is a real world impact.最好在扩展的同时对您的设置进行负载测试，以查看是否对现实世界产生影响。

For more spring boot info see:有关更多弹簧靴信息，请参阅：

https://github.com/spring-projects/spring-boot/issues/4657 https://github.com/corentin59/spring-boot-graceful-shutdown https://github.com/SchweizerischeBundesbahnen/springboot-graceful-shutdown https://github.com/spring-projects/spring-boot/issues/4657 https://github.com/corentin59/spring-boot-graceful-shutdown https://github.com/SchweizerischeBundesbahnen/springboot-graceful-关掉

You can use a preStop sleep if needed.如果需要，您可以使用 preStop 睡眠。 While the pod is removed from the service endpoints immediately, it still takes time (10-100ms) for the endpoint update to be sent to every node and for them to update iptables.虽然 pod 立即从服务端点中删除，但仍需要时间（10-100 毫秒）将端点更新发送到每个节点并让它们更新 iptables。

When your applications receives a SIGTERM (from the Pod termination) it needs to first stop reporting it is ready (fail the readinessProbe ) but still serve requests as they come in from clients.当您的应用程序收到SIGTERM （来自 Pod 终止）时，它需要首先停止报告它已准备就绪（使readinessProbe失败），但仍会在来自客户端的请求中提供服务。 After a certain time (depending on your readinessProbe settings) you can shut down the application.一段时间后（取决于您的readinessProbe设置），您可以关闭应用程序。

For Spring Boot there is a small library doing exactly that: springboot-graceful-shutdown对于 Spring Boot，有一个小型库就是这样做的： springboot-graceful-shutdown