简体   繁体   中英

Flink Consumer with DataStream API for Batch Processing - How do we know when to stop & How to stop processing [ 2 fold ]

I am basically trying to use the same Flink pipeline (of transformations, with different input parameters to distinguish between real-time and batch modes) to run it in Batch Mode & realtime mode. I want to use the DataStream API, as most of my transformations are dependent on DataStream API.

My Producer is Kafka & real time pipeline works just fine. Now I want to build a Batch pipeline with the same exact code with different topics for batch & real-time mode. How does my batch processor know when to stop processing?

One way I thought of was to add an extra parameter in the Producer record to say this is the last record, however, given multi partitioned topics, record delivery across multiple partitions does not guarantee the order (delivery inside one partition is guaranteed though).

What is the best practice to design this?

PS: I don't want to use DataSet API.

You can use the DataStream API for batch processing without any issue. Basically, Flink will inject the barrier that will mark the end of the stream, so that Your application will work on finite streams instead of infinite ones.

I am not sure if Kafka is the best solution for the problem to be completely honest.

Generally, when implementing KafkaDeserializationSchema You have the method isEndOfStream() that will mark that the stream has finished. Perhaps, You could inject the end markers for each partition and simply check if all of the markers have been read and then finish the stream. But this would require You to know the number of partitions beforehand.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM