简体   繁体   English

Akka Kubernetes 上的集群心跳延迟

[英]Akka Cluster heartbeat delays on Kubernetes

Our Scala application (Kubernetes deployment) constantly experience Akka Cluster heartbeat delays of ≈3s.我们的 Scala 应用程序(Kubernetes 部署)经常遇到 Akka 集群心跳延迟≈3s。

Once we even had a 200s delay which also manifested itself in the following graph:有一次我们甚至有 200 秒的延迟,这也体现在下图中:

grafana-200s

Can someone suggest things to investigate further?有人可以建议进一步调查的事情吗?

Specs眼镜

  • Kubernetes 1.12.5 Kubernetes 1.12.5
  • Scala 2.12.7 Scala 2.12.7
  • Java 11.0.4+11 Java 11.0.4+11
  • Akka Cluster 2.5.25 Akka 集群 2.5.25

Java Flight Recording Java 飞行记录

Some example:一些例子:

timestamp    delay_ms
06:24:55.743 2693
06:30:01.424 3390
07:31:07.495 2487
07:36:12.775 3758

There were 4 suspicious time points where lots of Java Thread Park events were registered simultaneously for Akka threads (actors & remoting) and all of them correlate to heartbeat issues:有 4 个可疑时间点,其中同时为 Akka 线程(参与者和远程处理)注册了大量Java 线程停放事件,并且所有这些都与心跳问题相关:

jfr-thread-park-1 jfr-thread-park-2

Around 07:05:39 there were no "heartbeat was delayed" logs, but was this one:大约07:05:39没有“心跳被延迟”日志,但这是一个:

07:05:39,673 WARN PhiAccrualFailureDetector heartbeat interval is growing too large for address SOME_IP: 3664 millis

No correlation with halt events or blocked threads were found during Java Flight Recording session, only two Safepoint Begin events in proximity to delays:在 Java 飞行记录 session 期间未发现与停止事件或阻塞线程的相关性,只有两个接近延迟的Safepoint Begin事件:

jfr-安全点开始

CFS throttling CFS 节流

The application CPU usage is low, so we thought it could be related to how K8s schedule our application node for CPU .应用程序 CPU 使用率较低,因此我们认为这可能与K8s 为我们的应用程序节点调度 CPU 的方式有关。 But turning off CPU limits haven't improved things much, though kubernetes.cpu.cfs.throttled.second metric disappeared.但是关闭 CPU 限制并没有太大改善,尽管kubernetes.cpu.cfs.throttled.second指标消失了。

Separate dispatcher独立调度员

Using a separate dispatcher seems to be unnecessary since delays happen even when there is no load, we also built an explicit application similar to our own which does nothing but heartbeats and it still experience these delays.使用单独的调度程序似乎是不必要的,因为即使没有负载也会发生延迟,我们还构建了一个类似于我们自己的显式应用程序,它只做心跳,它仍然会遇到这些延迟。

K8s cluster K8s集群

From our observations it happens way more frequently on a couple of K8s nodes in a large K8s cluster shared with many other apps when our application doesn't loaded much.从我们的观察来看,当我们的应用程序没有加载太多时,它更频繁地发生在与许多其他应用程序共享的大型 K8s 集群中的几个 K8s 节点上。

A separate dedicated K8s cluster where our app is load tested almost have no issues with heartbeat delays.一个单独的专用 K8s 集群,我们的应用程序在其中进行了负载测试,几乎没有心跳延迟的问题。

Have you been able to rule out garbage collection?你能排除垃圾收集吗? In my experience, that's the most common cause for delayed heartbeats in JVM distributed systems (and the CFS quota in a Kubernetes/Mesos environment can make non-Stop-The-World GCs effectively STW, especially if you're not using a really recent (later than release 212 of JDK8) version of openjdk).根据我的经验,这是 JVM 分布式系统中心跳延迟的最常见原因(并且 Kubernetes/Mesos 环境中的 CFS 配额可以使非停止世界 GC 有效地 STW,尤其是如果您没有使用真正最近的(晚于 JDK8 的 release 212)版本的 openjdk)。

Every thread parking before "Safepoint begin" does lead me to believe that GC is in fact the culprit. “安全点开始”之前的每个线程停放确实让我相信 GC 实际上是罪魁祸首。 Certain GC operations (eg rearranging the heap) require every thread to be in a safepoint, so every so often when not blocked, threads will check if the JVM wants them to safepoint;某些 GC 操作(例如重新排列堆)要求每个线程都处于安全点中,因此当没有被阻塞时,线程会经常检查 JVM 是否希望它们进入安全点; if so the threads park themselves in order to get to a safepoint.如果是这样,线程会自行停车以到达安全点。

If you've ruled out GC, are you running in a cloud environment (or on VMs where you can't be sure that the CPU or network aren't oversubscribed)?如果您已排除 GC,您是否在云环境中运行(或在无法确定 CPU 或网络没有超额订阅的 VM 上)? The akka-cluster documentation suggests increasing the akka.cluster.failure-detector.threshold value, which defaults to a value suitable for a more controlled LAN/bare-metal environment: 12.0 is recommended for cloud environments. akka-cluster 文档建议增加akka.cluster.failure-detector.threshold值,该值默认为适合更受控制的 LAN/裸机环境的值:云环境建议使用 12.0。 This won't prevent delayed heartbeats, but it will decrease the chances of a spurious downing event because of a single long heartbeat (and also delay responses to genuine node loss events).这不会阻止延迟的心跳,但它会减少由于单个长心跳而导致虚假宕机事件的机会(并且还会延迟对真正节点丢失事件的响应)。 If you want to tolerate a spike in heartbeat inter-arrival times from 1s to 200s, though, you'll need a really high threshold.但是,如果您想容忍心跳间隔时间从 1 秒到 200 秒的峰值,则需要一个非常高的阈值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM