Kubernetes (K3S) POD 在播放 5-20 小时后获得“ENOTFOUND”

Question

I'm running my Backend on Kubernetes on around 250 pods under 15 deployments, backend in written in NODEJS .我在Kubernetes上运行我的后端，在15 个部署下的大约250 个pod 上，后端用NODEJS编写。

Sometimes after X number of hours (5<X<30) I'm getting ENOTFOUND in one of the PODS, as follows:有时在 X 小时后 (5<X<30) 我在其中一个 PODS 中得到ENOTFOUND ，如下所示：

{
  "name": "main",
  "hostname": "entrypoint-sdk-54c8788caa-aa3cj",
  "pid": 19,
  "level": 50,
  "error": {
    "errno": -3008,
    "code": "ENOTFOUND",
    "syscall": "getaddrinfo",
    "hostname": "employees-service"
  },
  "msg": "Failed calling getEmployee",
  "time": "2022-01-28T13:44:36.549Z",
  "v": 0
}

I'm running a stress test on the Backend of YY number of users per second, but I'm keeping this stress level steady and not changing it, and then it happens out of nowhere with no specific reason.我正在对每秒YY用户数的后端进行压力测试，但我保持这个压力水平稳定而不改变它，然后它突然发生，没有具体原因。

Kubernetes is K3S Server Version: v1.21.5+k3s2 Kubernetes 为K3S服务器版本： v1.21.5+k3s2

Any idea what might cause this weird ENOTFOUND ?知道什么可能导致这种奇怪ENOTFOUND吗？

Answer 1

Already saw your same question on github and reference to getaddrinfo ENOTFOUND with newest versions .已经在 github 上看到了同样的问题，并参考了最新版本的 getaddrinfo ENOTFOUND 。

As per comments this issue does not appear in k3s 1.21, that is 1 version below yours.根据评论，此问题未出现在 k3s 1.21 中，即低于您的 1 个版本。 I know it almost impossible, but any chance to try similar setup on this ver?我知道这几乎是不可能的，但是有机会在这个版本上尝试类似的设置吗？

And it seems error comes from node/lib/dns.js .似乎错误来自node/lib/dns.js 。

function errnoException(err, syscall, hostname) {
  // FIXME(bnoordhuis) Remove this backwards compatibility nonsense and pass
  // the true error to the user. ENOTFOUND is not even a proper POSIX error!
  if (err === uv.UV_EAI_MEMORY ||
      err === uv.UV_EAI_NODATA ||
      err === uv.UV_EAI_NONAME) {
    err = 'ENOTFOUND';
  }

What I wanted to suggest you is to check Solving DNS lookup failures in Kubernetes .我想建议您检查在 Kubernetes 中解决 DNS 查找失败。 Article describes long hard way of catching the same error you have that also bothered from time to time.文章描述了捕捉您不时困扰的相同错误的漫长艰难方法。

As a solution aftet investigating all the metrics, logs, etc - was installing K8s cluster add-on called Node Local DNS cache , that作为调查所有指标、日志等的解决方案 - 正在安装名为Node Local DNS 缓存的 K8s 集群插件，即

improves Cluster DNS performance by running a dns caching agent on cluster nodes as a DaemonSet.通过在集群节点上作为 DaemonSet 运行 dns 缓存代理来提高集群 DNS 性能。 In today's architecture, Pods in ClusterFirst DNS mode reach out to a kube-dns serviceIP for DNS queries.在今天的架构中，ClusterFirst DNS 模式下的 Pod 会访问 kube-dns 服务 IP 以进行 DNS 查询。 This is translated to a kube-dns/CoreDNS endpoint via iptables rules added by kube-proxy.这通过 kube-proxy 添加的 iptables 规则转换为 kube-dns/CoreDNS 端点。 With this new architecture, Pods will reach out to the dns caching agent running on the same node, thereby avoiding iptables DNAT rules and connection tracking.借助这种新架构，Pod 将访问在同一节点上运行的 dns 缓存代理，从而避免 iptables DNAT 规则和连接跟踪。 The local caching agent will query kube-dns service for cache misses of cluster hostnames(cluster.local suffix by default).本地缓存代理将查询 kube-dns 服务以获取集群主机名的缓存未命中（默认为 cluster.local 后缀）。

Motivation动机

With the current DNS architecture, it is possible that Pods with the highest DNS QPS have to reach out to a different node, if there is no local kube-dns/CoreDNS instance.使用当前的 DNS 架构，如果没有本地 kube-dns/CoreDNS 实例，则具有最高 DNS QPS 的 Pod 可能必须访问不同的节点。 Having a local cache will help improve the latency in such scenarios.拥有本地缓存将有助于改善这种情况下的延迟。

Skipping iptables DNAT and connection tracking will help reduce conntrack races and avoid UDP DNS entries filling up conntrack table.跳过 iptables DNAT 和连接跟踪将有助于减少 conntrack 竞争并避免 UDP DNS 条目填满 conntrack 表。

Connections from local caching agent to kube-dns service can be upgraded to TCP.从本地缓存代理到 kube-dns 服务的连接可以升级到 TCP。 TCP conntrack entries will be removed on connection TCP conntrack 条目将在连接时被删除
close in contrast with UDP entries that have to timeout (default与必须超时的 UDP 条目（默认
nf_conntrack_udp_timeout is 30 seconds) nf_conntrack_udp_timeout 为 30 秒）

Upgrading DNS queries from UDP to TCP would reduce tail latency attributed to dropped UDP packets and DNS timeouts usually up to 30s Upgrading DNS queries from UDP to TCP would reduce tail latency attributed to dropped UDP packets and DNS timeouts usually up to 30s
(3 retries + 10s timeout). （3 次重试 + 10 秒超时）。 Since the nodelocal cache listens for UDP由于nodelocal缓存监听UDP
DNS queries, applications don't need to be changed. DNS 查询，无需更改应用程序。

Metrics & visibility into dns requests at a node level.在节点级别对 dns 请求的指标和可见性。

Negative caching can be re-enabled, thereby reducing number of queries to kube-dns service.可以重新启用负缓存，从而减少对 kube-dns 服务的查询次数。

Kubernetes (K3S) POD 在播放 5-20 小时后获得“ENOTFOUND”

问题描述

1 个解决方案

解决方案1
1 2022-01-31 10:37:54

Kubernetes (K3S) POD 在播放 5-20 小时后获得“ENOTFOUND”

问题描述

1 个解决方案

解决方案1 1 2022-01-31 10:37:54

解决方案1
1 2022-01-31 10:37:54