对等连接重置-在k8s集群上使用驱动程序2.8.0和mongo 4.0.9

Question

We have been getting "Connection Reset by Peer" mongo errors in our setup. 我们的设置中出现了“对等连接重置” mongo错误。 A description of the setup: 设置说明：

mongo running as a replicaset in a k8s cluster on EKS mongo在EKS上的k8s集群中作为副本集运行
clients (C#) running in the same k8s cluster on EKS 在EKS上的相同k8s集群中运行的客户端（C＃）
mongo 4.0.9 蒙戈4.0.9
C# driver 2.8.0 C＃驱动程序2.8.0
Connection pooling ON 连接池打开
max idle time set to 10min (overrode default of 10s) 最大空闲时间设置为10分钟（默认为10秒）
max connection lifetime set to 10 min (overrode default of 10s) 最大连接寿命设置为10分钟（默认覆盖10秒）

We get these errors. 我们得到这些错误。 We observed that if there is a series of calls, say 500 calls to do a key based select, there is no issue. 我们观察到，如果存在一系列调用，例如500个调用以执行基于键的选择，则没有问题。 Then we pause for 5 minutes, and repeat the test, the first time we get a "Connection Reset by Peer". 然后我们暂停5分钟，然后重复测试，这是我们第一次收到“对等方重置连接”的信息。 Later, the test continues. 稍后，测试将继续。 This happens every time after pause. 暂停后每次都会发生这种情况。

This condition repeats with real users behavior, there may be spurts of activity and then a lull. 这种情况会随着真实用户的行为而重复出现，可能会突然出现活动，然后停顿下来。 As a consequence we keep getting "Connection reset by peer" at critical parts in the business workflow. 结果，我们在业务工作流的关键部分不断获得“对等连接重置”。 On the client side, the solution is to perform defensive coding and repeat the call, but that's a change in many places. 在客户端，解决方案是执行防御性编码并重复呼叫，但这是许多地方的变化。

Other combinations attempted: 尝试了其他组合：

mongo 4.0.9 蒙戈4.0.9
C# driver 2.8.0 C＃驱动程序2.8.0
Connection pooling ON 连接池打开
max idle time 120min 最大闲置时间120min
max connection lifetime 60min 最大连接寿命60min

However no change in the behavior. 但是，行为没有变化。

It appears to us that while the TCP connection is closed on the server side, the client still thinks that it's a valid connection and attempts to use it, leading to this error. 在我们看来，尽管在服务器端关闭了TCP连接，但客户端仍然认为这是有效连接，并尝试使用它，从而导致此错误。

Has anybody else faced such a situation? 还有其他人遇到过这种情况吗？ Any suggestions would be appreciated, happy to provide more information if needed. 任何建议将不胜感激，如果需要，乐意提供更多信息。

Answer 1

I have a very similar issue with a cluster running on AKS. 我在AKS上运行的群集有一个非常相似的问题。 I managed to track this back to conntrack seeing (or thinking it is seeing) tcp retransmissions. 我设法将其追溯到conntrack看到（或认为正在看到）tcp重传。 Here is an example in which the client pod is 10.3.0.8 and the server pod 10.3.0.113, looking at the conntrack entries on the node running mongo: 这是一个示例，其中客户机pod为10.3.0.8，服务器pod为10.3.0.113，查看运行mongo的节点上的conntrack条目：

conntrack -L|grep "10\\.3\\.0\\.113"|grep "10\\.3\\.0\\.88"

conntrack v1.4.3 (conntrack-tools): 1091 flow entries have been shown.
tcp      6 86398 ESTABLISHED src=10.3.0.88 dst=10.3.0.113 sport=34919 dport=27017 src=10.3.0.113 dst=10.3.0.88 sport=27017 dport=34919 [ASSURED] mark=0 use=1
tcp      6 86398 ESTABLISHED src=10.3.0.88 dst=10.3.0.113 sport=33389 dport=27017 src=10.3.0.113 dst=10.3.0.88 sport=27017 dport=33389 [ASSURED] mark=0 use=1
tcp      6 86390 ESTABLISHED src=10.3.0.88 dst=10.3.0.113 sport=39917 dport=27017 src=10.3.0.113 dst=10.3.0.88 sport=27017 dport=39917 [ASSURED] mark=0 use=1
tcp      6 51 TIME_WAIT src=10.3.0.88 dst=10.3.0.113 sport=36649 dport=27017 src=10.3.0.113 dst=10.3.0.88 sport=27017 dport=36649 [ASSURED] mark=0 use=1
tcp      6 298 ESTABLISHED src=10.3.0.88 dst=10.3.0.113 sport=35033 dport=27017 src=10.3.0.113 dst=10.3.0.88 sport=27017 dport=35033 [ASSURED] mark=0 use=1
tcp      6 299 ESTABLISHED src=10.3.0.88 dst=10.3.0.113 sport=44131 dport=27017 src=10.3.0.113 dst=10.3.0.88 sport=27017 dport=44131 [ASSURED] mark=0 use=1

You can see that there are some entries with very low timeouts (298/299 seconds) -- these started with 86400 seconds (/proc/sys/net/netfilter/nf_conntrack_tcp_timeout_established) but have been moved to 300 seconds (nf_conntrack_tcp_timeout_max_retrans). 您可以看到有些条目的超时时间很短（298/299秒）-它们以86400秒（/ proc / sys / net / netfilter / nf_conntrack_tcp_timeout_建立）开始，但已移至300秒（nf_conntrack_tcp_timeout_max_retrans）。 I am reasonably sure that this is the case because changing nf_conntrack_tcp_timeout_max_retrans changes the timeout value above. 我可以肯定地是这种情况，因为更改nf_conntrack_tcp_timeout_max_retrans会更改上面的超时值。

I am at the present stage not sure why the retransmissions are occurring but it would be interesting to know if your problem is the same. 我目前不知道为什么发生重传，但是知道您的问题是否相同将很有趣。

It can be worked around in my case by increasing nf_conntrack_tcp_timeout_max_retrans to > 10 minutes, or decreasing the mongo idle connection timeout to < 5 minutes. 在我的情况下，可以通过将nf_conntrack_tcp_timeout_max_retrans增加到> 10分钟或将mongo空闲连接超时减少到<5分钟来解决。

对等连接重置-在k8s集群上使用驱动程序2.8.0和mongo 4.0.9

问题描述

1 个解决方案

解决方案1
1 2019-08-14 14:19:47

对等连接重置-在k8s集群上使用驱动程序2.8.0和mongo 4.0.9

问题描述

1 个解决方案

解决方案1 1 2019-08-14 14:19:47

解决方案1
1 2019-08-14 14:19:47