简体   繁体   中英

Flink 1.16 Restart Strategy working fine, but losing the messages when entire job manager restarting

Restart Strategy working fine, but losing the messages when entire job manager restarting after the max retry attempt. For example, I have send a 2 msg continuously, first msg has a exception so its retrying with max attempt which I mentioned in the config. After that its restarting the entire job manager. this time am losing the second message.

            streamExecutionEnvironment.setRestartStrategy(
                    RestartStrategies.fixedDelayRestart(4, // number of restart attempts
                                                        Time.of(4, TimeUnit.SECONDS) // delay
                    ));

Once the Job manager is comesup, I expected to consume the second message. but its not consuming. seems like we are losing the second message.Could any one help me out for this situation?

Without more information, it's not clear what's happening, or why. But I'll throw out some guesses, and perhaps one will be correct.

You could be stuck in a fail -> restart -> fail again loop. If you don't skip over poison pills, Flink will:

  1. throw an exception caused by a poison pill (a record that can't be processed for some reason)
  2. restart
  3. try again to consume the poison pill, and fail again
  4. restart again
  5. ...

Or you could be using a source that doesn't participate in checkpointing.

Or perhaps your source isn't rewindable. Flink's approach to fault tolerance requires that the sources can be rewound, and then any input records consumed since the last checkpoint will be re-ingested after a restart. But some sources can't support this (eg, sockets, or http endpoints).

Or you could be relying on the Job Manager for JobManagerCheckpointStorage , in which case the checkpoints are lost when the JM restarts.

A job failure shouldn't cause a Job Manager restart. And it sounds your cluster is probably not set up to handle recovery after a Job Manager failure. You could read the docs on HA for your specific resource provider -- the entry point is here .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM