简体   繁体   中英

Kafka: Bounded Batch Processing in Parallel

I would like to use Kafka to perform bounded batch processing , where the program will know when it is processing the last record.

Batch:

  • Reading a flat file
  • Send each line as message to Kafka

Kafka Listener:

  • Consumes message from Kafka
  • Insert record into database
  • If it is the last record, mark batch job as done in database.

One way probably is to use a single Kafka partition, assuming FIFO (First In First Out) is guaranteed, and make the batch program to send an isLastRecord flag.

However, this means the processing will be restricted to single-thread (single consumer).

Question

Is there any way to achieve this with parallel-processing by leveraging multiple Kafka partitions?

If you need in-order guarantees per file, you are restricted to a single partition.

If you have multiple files, you could use different partitions for different files though.

If each line in the file is an insert into a database, I am wondering though if you need in-order guarantee in the first place, or if you can insert all records/lines in any order?

A more fundamental question is: why do you need to put the data into Kafka first? Why not read the file and to the insert directly?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM