简体繁体 English

如何修复Apache Flink中卡住的检查点

[英]How to fix stuck checkpoints in Apache Flink

原文 2019-09-07 13:07:43 7 2 java/ apache-flink

I have a setup in Flink 1.7.2 running on a Cloudera-managed cluster (resource allocation via Yarn) that gets high-volume data from an external Kafka and pipes it through a series of operators that aggregate, compute, aggregate again...I even use an iterative loop with filters and multiple operators inside, and finally a sink that writes the results to a rocksDB backend on my Hadoop cluster. 我在运行于Cloudera管理的集群（通过Yarn进行资源分配）上的Flink 1.7.2中进行了设置，该设置从外部Kafka获取大量数据，并通过一系列运算符将其传输给管道，这些运算符再次进行汇总，计算和汇总...我什至使用内部带有过滤器和多个运算符的迭代循环，最后使用一个将结果写入我的Hadoop集群上的rockDB后端的接收器。 All of that works for a certain amount of time (currently around 2-3 hours), and then the checkpointing gets stuck. 所有这些都需要一定的时间（目前大约2-3小时），然后检查点会卡住。 I use exactly-once checkpointing with a generous timeout of 30 min and 10 min pause between checkpoints. 我使用一次精确的检查点，在检查点之间有30分钟的大超时和10分钟的暂停。 1 concurrent checkpoint. 1个并发检查点。 As long as everything works, these checkpoints finish within 1 min. 只要一切正常，这些检查点将在1分钟内完成。 But after a couple of hours one checkpoints gets stuck, meaning that the Checkpoint-UI tab tells me that one (or multiple) operators have not acknowledged all subtasks. 但是几个小时后，一个检查点卡住了，这意味着Checkpoint-UI选项卡告诉我一个（或多个）操作员没有确认所有子任务。 By that time the normal process will have gotten stuck as well. 到那时，正常流程也将陷入困境。 The watermarks on my input source won't proceed and no more output will be produced. 我的输入源上的水印将不会继续，并且将不再产生输出。 And they won't until the timer runs out. 而且，直到计时器用完，他们才会这样做。 The next checkpoint then immediately activates, writes maybe 10% of all tasks and gets stuck again. 然后，下一个检查点立即激活，可能写入所有任务的10％，然后再次卡住。 No chance of recovery. 没有恢复的机会。 If I cancel the job and restart it with the last successful checkpoint as starting point, the next checkpoint will get stuck in the same way. 如果我取消作业并以上一个成功的检查点为起点重新启动它，则下一个检查点将以相同的方式卡住。

I have tried many different things already, from changing the checkpoint frequency to the timeouts. 从更改检查点频率到超时，我已经尝试了许多不同的方法。 I even changed from exactly-once to at-least-once since the alignment buffering was sometimes getting very expensive. 由于对齐缓冲有时会变得非常昂贵，所以我甚至从完全一次更改为至少一次。 But even then the same problem emerged after the same amount of time. 但是即使那样，同样的问题在经过相同的时间后仍然出现。 Resource allocation does not seem to play a role either, I currently use 4 task slots per task manager and change the number of managers from time to time, but nothing changes. 资源分配似乎也不起作用，我目前每个任务管理器使用4个任务槽，并不时更改管理器的数量，但没有任何变化。 JVM heap-size does not appear to be a problem either, as I commit multiple GB, but apparently only a couple hundred MB are used. JVM堆大小似乎也不是问题，因为我提交了多个GB，但是显然只使用了几百MB。

No error messages are put out by the job- or taskmanagers, all the logs tell me is the attempt to write the checkpoint, the missing success-message and then the start of the next checkpoint. 作业经理或任务经理不会发出任何错误消息，所有日志告诉我的是尝试写入检查点，丢失成功消息以及下一个检查点开始的尝试。

2 个解决方案

When you say that you use "an iterative loop with filters and multiple operators inside", are you using Flink's iteration construct with a streaming job? 当您说使用“一个内部包含过滤器和多个运算符的迭代循环”时，您是否正在对流作业使用Flink的迭代构造？

Doing so is not recommended. 不建议这样做。 As it says in the documentation : 如文档中所述：

Flink currently only provides processing guarantees for jobs without iterations. Flink当前仅为没有迭代的作业提供处理保证。 Enabling checkpointing on an iterative job causes an exception. 在迭代作业上启用检查点会导致异常。 In order to force checkpointing on an iterative program the user needs to set a special flag when enabling checkpointing: env.enableCheckpointing(interval, CheckpointingMode.EXACTLY_ONCE, force = true) . 为了在迭代程序上强制检查点，用户需要在启用检查点时设置特殊标志： env.enableCheckpointing(interval, CheckpointingMode.EXACTLY_ONCE, force = true) 。

Please note that records in flight in the loop edges (and the state changes associated with them) will be lost during failure. 请注意，在故障期间，循环边缘中正在运行的记录（以及与它们相关的状态更改）将丢失。

That said, what you've described sounds like a situation in which backpressure is preventing the checkpoint barriers from progressing. 就是说，您所描述的听起来像是背压阻止检查点障碍前进的情况。 Many things might cause this, but this blog post might help you diagnose the problem. 可能是由很多原因引起的，但是此博客文章可能会帮助您诊断问题。 But I'm not sure how much of that applies to a job using iterations. 但是我不确定其中有多少适用于使用迭代的工作。