Flink Yarn infinite restart on task failure

Question

I am running flink streaming job on AWS yarn cluster with below configuration

Master Node - 1, Core Node - 1, Task Nodes - 3

And I enabled

jobmanager.execution.failover-strategy: region

As one of my task nodes are failing and trying to restart at region level (in my case at task node level) and I enabled the restart strategy as fixedDelayrestart with 5 attempts of 5 minutes delay and my checkpoints are disabled.

Reference Image

If you see the image it is restarting more than expected.

Can anybody help me understand why does it is behaving like this?

Answer 1

The documentation has a section about the "Restart Pipelined Region Failover Strategy" [1]. The bottom line is, if you have a streaming job with an operator that physically partitions the stream, such as keyBy , all tasks will end up being in the same region, and therefore all tasks will be restarted as a whole. For batch jobs, you need to configure the ExecutionMode [2] to be BATCH or BATCH_FORCED .

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-pipelined-region-failover-strategy

[2] https://ci.apache.org/projects/flink/flink-docs-release-1.9/api/java/org/apache/flink/api/common/ExecutionMode.html

Flink Yarn infinite restart on task failure

Question

1 answers

solution1
1 2019-11-25 13:44:58

Flink Yarn infinite restart on task failure

Question

1 answers

solution1 1 2019-11-25 13:44:58

solution1
1 2019-11-25 13:44:58