简体   繁体   English

Kubernetes (K3S) POD 在播放 5-20 小时后获得“ENOTFOUND”

[英]Kubernetes (K3S) POD gets "ENOTFOUND" after 5-20 hours of airing time

I'm running my Backend on Kubernetes on around 250 pods under 15 deployments, backend in written in NODEJS .我在Kubernetes上运行我的后端,在15 个部署下的大约250 个pod 上,后端用NODEJS编写。

Sometimes after X number of hours (5<X<30) I'm getting ENOTFOUND in one of the PODS, as follows:有时在 X 小时后 (5<X<30) 我在其中一个 PODS 中得到ENOTFOUND ,如下所示:

{
  "name": "main",
  "hostname": "entrypoint-sdk-54c8788caa-aa3cj",
  "pid": 19,
  "level": 50,
  "error": {
    "errno": -3008,
    "code": "ENOTFOUND",
    "syscall": "getaddrinfo",
    "hostname": "employees-service"
  },
  "msg": "Failed calling getEmployee",
  "time": "2022-01-28T13:44:36.549Z",
  "v": 0
}

I'm running a stress test on the Backend of YY number of users per second, but I'm keeping this stress level steady and not changing it, and then it happens out of nowhere with no specific reason.我正在对每秒YY用户数的后端进行压力测试,但我保持这个压力水平稳定而不改变它,然后它突然发生,没有具体原因。

Kubernetes is K3S Server Version: v1.21.5+k3s2 Kubernetes 为K3S服务器版本: v1.21.5+k3s2

Any idea what might cause this weird ENOTFOUND ?知道什么可能导致这种奇怪ENOTFOUND吗?

Already saw your same question on github and reference to getaddrinfo ENOTFOUND with newest versions .已经在 github 上看到了同样的问题,并参考了最新版本的 getaddrinfo ENOTFOUND

As per comments this issue does not appear in k3s 1.21, that is 1 version below yours.根据评论,此问题未出现在 k3s 1.21 中,即低于您的 1 个版本。 I know it almost impossible, but any chance to try similar setup on this ver?我知道这几乎是不可能的,但是有机会在这个版本上尝试类似的设置吗?

And it seems error comes from node/lib/dns.js .似乎错误来自node/lib/dns.js

function errnoException(err, syscall, hostname) {
  // FIXME(bnoordhuis) Remove this backwards compatibility nonsense and pass
  // the true error to the user. ENOTFOUND is not even a proper POSIX error!
  if (err === uv.UV_EAI_MEMORY ||
      err === uv.UV_EAI_NODATA ||
      err === uv.UV_EAI_NONAME) {
    err = 'ENOTFOUND';
  }

What I wanted to suggest you is to check Solving DNS lookup failures in Kubernetes .我想建议您检查在 Kubernetes 中解决 DNS 查找失败 Article describes long hard way of catching the same error you have that also bothered from time to time.文章描述了捕捉您不时困扰的相同错误的漫长艰难方法。

As a solution aftet investigating all the metrics, logs, etc - was installing K8s cluster add-on called Node Local DNS cache , that作为调查所有指标、日志等的解决方案 - 正在安装名为Node Local DNS 缓存的 K8s 集群插件,即

improves Cluster DNS performance by running a dns caching agent on cluster nodes as a DaemonSet.通过在集群节点上作为 DaemonSet 运行 dns 缓存代理来提高集群 DNS 性能。 In today's architecture, Pods in ClusterFirst DNS mode reach out to a kube-dns serviceIP for DNS queries.在今天的架构中,ClusterFirst DNS 模式下的 Pod 会访问 kube-dns 服务 IP 以进行 DNS 查询。 This is translated to a kube-dns/CoreDNS endpoint via iptables rules added by kube-proxy.这通过 kube-proxy 添加的 iptables 规则转换为 kube-dns/CoreDNS 端点。 With this new architecture, Pods will reach out to the dns caching agent running on the same node, thereby avoiding iptables DNAT rules and connection tracking.借助这种新架构,Pod 将访问在同一节点上运行的 dns 缓存代理,从而避免 iptables DNAT 规则和连接跟踪。 The local caching agent will query kube-dns service for cache misses of cluster hostnames(cluster.local suffix by default).本地缓存代理将查询 kube-dns 服务以获取集群主机名的缓存未命中(默认为 cluster.local 后缀)。

Motivation动机

  • With the current DNS architecture, it is possible that Pods with the highest DNS QPS have to reach out to a different node, if there is no local kube-dns/CoreDNS instance.使用当前的 DNS 架构,如果没有本地 kube-dns/CoreDNS 实例,则具有最高 DNS QPS 的 Pod 可能必须访问不同的节点。 Having a local cache will help improve the latency in such scenarios.拥有本地缓存将有助于改善这种情况下的延迟。
  • Skipping iptables DNAT and connection tracking will help reduce conntrack races and avoid UDP DNS entries filling up conntrack table.跳过 iptables DNAT 和连接跟踪将有助于减少 conntrack 竞争并避免 UDP DNS 条目填满 conntrack 表。
  • Connections from local caching agent to kube-dns service can be upgraded to TCP.从本地缓存代理到 kube-dns 服务的连接可以升级到 TCP。 TCP conntrack entries will be removed on connection TCP conntrack 条目将在连接时被删除
    close in contrast with UDP entries that have to timeout (default与必须超时的 UDP 条目(默认
    nf_conntrack_udp_timeout is 30 seconds) nf_conntrack_udp_timeout 为 30 秒)
  • Upgrading DNS queries from UDP to TCP would reduce tail latency attributed to dropped UDP packets and DNS timeouts usually up to 30s Upgrading DNS queries from UDP to TCP would reduce tail latency attributed to dropped UDP packets and DNS timeouts usually up to 30s
    (3 retries + 10s timeout). (3 次重试 + 10 秒超时)。 Since the nodelocal cache listens for UDP由于nodelocal缓存监听UDP
    DNS queries, applications don't need to be changed. DNS 查询,无需更改应用程序。
  • Metrics & visibility into dns requests at a node level.在节点级别对 dns 请求的指标和可见性。
  • Negative caching can be re-enabled, thereby reducing number of queries to kube-dns service.可以重新启用负缓存,从而减少对 kube-dns 服务的查询次数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在javascript中验证范围为5-20的数字? - How to validate digits only with range 5-20 in javascript? 特定时间后从几天切换到几小时 - Switch from days to hours after a certain time Chrome扩展程序20小时后没有“Access-Control-Allow-Origin”标题 - Chrome extension No 'Access-Control-Allow-Origin' header after 20 hours 将 24 小时添加到特定时间后,在网页上显示时间 x - Display time x on webpage after adding 24 hours to a specifc time React Websocket 在一段时间后变得不活跃 - React Websocket gets inactive after some time 在JS中的另一个多维数组(20k元素)中计算数组的单词(12k元素)出现次数 - Count array's words (12k elements) occurrences in another multidimensional array (20k elements) in JS 具有编码空间 %20 的复制链接被第二次编码。 如何避免这种情况? - Copied link with encoded space %20 gets encoded second time. How to avoid this? 显示与 JSON 数据的“开放时间”日期/时间对齐的 div - show div aligning with JSON data's 'Open Hours' date/time 从一组时间范围(休息)中减去后如何计算总小时数(工作时间) - How to calculate the total hours (working hours) after subtracting from a set of time ranges (break) 如何检查Chrome中系统的时间格式-使用JavaScript是12个小时还是24个小时? - How to check system's time format in chrome - 12 hours or 24 hours in JavaScript?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM