简体   繁体   中英

Flink: Cluster Execution error of loss of Taskmanager

I am running the real time streaming program on the Flink with 1 master and 2 workers. One worker is running on the separate machine while another one is running on the master machine itself. I am using the JAR of my program in which the Parallelism is set to 2. Also I am read the data from Kafka with 2 brokers and 2 partitions.

With this scenario when I submit the job to the Flink cluster, it will run for a while and get failed with error java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager 82f8941ff339603995e37c453f8ff401 java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager 82f8941ff339603995e37c453f8ff401 . What is the probable reason of the loss of taskmanager? (Only one Task-manager which is on the master machine is lost, another one is still there and being shown at Flink Web Interface.)

I meet the problem,too.And I find this.

If you see a java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager even though the TaskManager did actually not crash, it means that the TaskManager was unresponsive for a time. That can be due to network issues, but is frequently due to long garbage collection stalls. In this case, a quick fix would be to use an incremental Garbage Collector, like the G1 garbage collector. It usually leads to shorter pauses. Furthermore, you can dedicate more memory to the user code by reducing the amount of memory Flink grabs for its internal operations (see configuration of TaskManager managed memory).If both of these approaches fail and the error persists, simply increase the TaskManager's heartbeat pause by setting AKKA_WATCH_HEARTBEAT_PAUSE (akka.watch.heartbeat.pause) to a greater value (eg 600s). This will cause the JobManager to wait for a heartbeat for a longer time interval before considering the TaskManager lost.

The Solution is given by https://flink.apache.org/faq.html

I hope it can help you.

As ulysses said in his anwer, you can increment the time used for the heartbeat or use an incremental Garbage Collector like G1GC (Flink's docker images already use this garbage collector if it's available).

To enable G1GC you have to add the following argument to the java command that launchs your flink's task manager:

-XX:+UseG1GC

You can find more info about this incremental Garbage collector in the following links:

We observed this error when the node that is used for this task manager has lack of free space.

Currently using Flink 3.7.1, but earlier version were also affected.

This was reported ieven as a bug https://issues.apache.org/jira/browse/FLINK-5844 But was closed, because the reporter wasn't responding.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM