简体繁体中英

Flink: Cluster Execution error of loss of Taskmanager

原文 2017-01-25 07:02:00 1 3 java/ apache-flink/ flink-streaming

I am running the real time streaming program on the Flink with 1 master and 2 workers. One worker is running on the separate machine while another one is running on the master machine itself. I am using the JAR of my program in which the Parallelism is set to 2. Also I am read the data from Kafka with 2 brokers and 2 partitions.

With this scenario when I submit the job to the Flink cluster, it will run for a while and get failed with error java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager 82f8941ff339603995e37c453f8ff401 java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager 82f8941ff339603995e37c453f8ff401 . What is the probable reason of the loss of taskmanager? (Only one Task-manager which is on the master machine is lost, another one is still there and being shown at Flink Web Interface.)

3 answers

I meet the problem,too.And I find this.

If you see a java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager even though the TaskManager did actually not crash, it means that the TaskManager was unresponsive for a time. That can be due to network issues, but is frequently due to long garbage collection stalls. In this case, a quick fix would be to use an incremental Garbage Collector, like the G1 garbage collector. It usually leads to shorter pauses. Furthermore, you can dedicate more memory to the user code by reducing the amount of memory Flink grabs for its internal operations (see configuration of TaskManager managed memory).If both of these approaches fail and the error persists, simply increase the TaskManager's heartbeat pause by setting AKKA_WATCH_HEARTBEAT_PAUSE (akka.watch.heartbeat.pause) to a greater value (eg 600s). This will cause the JobManager to wait for a heartbeat for a longer time interval before considering the TaskManager lost.

The Solution is given by https://flink.apache.org/faq.html

I hope it can help you.

As ulysses said in his anwer, you can increment the time used for the heartbeat or use an incremental Garbage Collector like G1GC (Flink's docker images already use this garbage collector if it's available).

To enable G1GC you have to add the following argument to the java command that launchs your flink's task manager:

-XX:+UseG1GC

You can find more info about this incremental Garbage collector in the following links:

We observed this error when the node that is used for this task manager has lack of free space.

Currently using Flink 3.7.1, but earlier version were also affected.

This was reported ieven as a bug https://issues.apache.org/jira/browse/FLINK-5844 But was closed, because the reporter wasn't responding.

Flink: Jar file execution on Flink cluster

Apache Flink (Error in stdout in cluster)

How to restart flink job to use added TaskManager

flink - cluster not using cluster

Code execution stops due to loss of precesion error

flink: job won't run with higher taskmanager.heap.mb

Flink cluster not starting due to Could not find or load main class error

Flink cluster on EKS

How to increase Flink taskmanager.numberOfTaskSlots to run it without Flink server(in IDE or fat jar)

Flink: Wrap executable non-flink jar to run it in a flink cluster

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Flink: Jar file execution on Flink cluster Apache Flink (Error in stdout in cluster) How to restart flink job to use added TaskManager flink - cluster not using cluster Code execution stops due to loss of precesion error flink: job won't run with higher taskmanager.heap.mb Flink cluster not starting due to Could not find or load main class error Flink cluster on EKS How to increase Flink taskmanager.numberOfTaskSlots to run it without Flink server(in IDE or fat jar) Flink: Wrap executable non-flink jar to run it in a flink cluster

Related Tags

Flink: Cluster Execution error of loss of Taskmanager

Question

3 answers

solution1
1 2017-02-15 05:50:33

solution2
1 2017-12-04 08:02:17

solution3
0 2020-02-19 14:53:19

Flink: Cluster Execution error of loss of Taskmanager

Question

3 answers

solution1 1 2017-02-15 05:50:33

solution2 1 2017-12-04 08:02:17

solution3 0 2020-02-19 14:53:19

solution1
1 2017-02-15 05:50:33

solution2
1 2017-12-04 08:02:17

solution3
0 2020-02-19 14:53:19