简体繁体中英

Interesting Flink Problem - How to restore state in Flink if the task manager fails in order to gurantee exactly once processing?

原文 2022-09-19 03:46:01 7 2 java/ apache-kafka/ apache-flink/ flink-streaming

Hi I am new to Flink and trying to figure out some best practices with the following scenerio:

I am playing around with a Flink job that reads unique data from multiple CSV files. Each row in the CSV is composed of three columns: userId, appId, name. I have to do some processing on each of these records (capitalize the name) and post the record to a Kafka Topic.

The goal is to filter out any duplicate records that exist so we do not have duplicate messages in the output Kafka Topic.

I am doing a keyBy(userId, appId) on the stream and keeping a boolean value state "Processed" to filter out duplicate records.

The issue is when I cancel the Task Manager in the middle of processing a file, to simulate a failure, it will start processing the file from the beginning once it restarts.

This is a problem because the "Processed" State in the Flink job is also wiped clean after the Task Manager fails!

This leads to duplicate messages in the output Kafka topic.

How can I prevent this?

I need to restore the "Processed" Flink state to what it was prior to the Task Manager failing. What is the best practice to do this?

Would Flink's checkpointed function https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/state/#checkpointedfunction help? I think not because this is a keyed stream.

Things to consider:

Flink Checkpointing is already enabled.
K8 pod for Task Manager (can be scaled very fast) and Parallelism is always > 1.
Files can have millions of rows and need to be processed in parallel.

Thank you for the help!

2 answers

I would recommend to read up on Flink's fault tolerance mechanisms, checkpointing & savepointing. https://nightlies.apache.org/flink/flink-docs-master/docs/learn-flink/fault_tolerance/ and https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpoints/ are good places to start.

I think you could also achieve your deduplication easier by using Table API/SQL. See https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/deduplication/

You need to use flink's managed keyed state, which Flink will maintain in a manner that is consistent with the sources and sinks, and Flink will guarantee exactly-once behavior provided you setup checkpointing and configure the sinks properly.

There's a tutorial on this topic in the Flink documentation , and it happens to use deduplication as the example use case .

For more info on how to configure Kafka for exactly-once operation, see Exactly-once with Apache Kafka .

Disclaimer: I wrote that part of the Flink documentation, and I work for Immerok.

Flink Task Manager timeout

How to properly initialize task state at Apache Flink?

How to know which task manager my flink job is running on?

Flink (1.3.2) Broadcast record to every operator exactly once

Flink Kafka EXACTLY_ONCE causing KafkaException ByteArraySerializer is not an instance of Serializer

How does Flink runtime get task manager JVM metrics like 'Status.JVM.Memory.Heap.Used'?

How to print the Execution Plan for batch processing in Flink?

How to change the maximum parallelism of an operator in Flink State

How to determine number of task slots in flink

Flink State Schema Migration

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Flink Task Manager timeout How to properly initialize task state at Apache Flink? How to know which task manager my flink job is running on? Flink (1.3.2) Broadcast record to every operator exactly once Flink Kafka EXACTLY_ONCE causing KafkaException ByteArraySerializer is not an instance of Serializer How does Flink runtime get task manager JVM metrics like 'Status.JVM.Memory.Heap.Used'? How to print the Execution Plan for batch processing in Flink? How to change the maximum parallelism of an operator in Flink State How to determine number of task slots in flink Flink State Schema Migration

Related Tags

Interesting Flink Problem - How to restore state in Flink if the task manager fails in order to gurantee exactly once processing?

Question

2 answers

solution1
0 2022-09-19 13:29:15

solution2
0 2022-09-19 21:03:01

Interesting Flink Problem - How to restore state in Flink if the task manager fails in order to gurantee exactly once processing?

Question

2 answers

solution1 0 2022-09-19 13:29:15

solution2 0 2022-09-19 21:03:01

solution1
0 2022-09-19 13:29:15

solution2
0 2022-09-19 21:03:01