简体   繁体   English

Kafka:并行的有边界批处理

[英]Kafka: Bounded Batch Processing in Parallel

I would like to use Kafka to perform bounded batch processing , where the program will know when it is processing the last record. 我想使用Kafka进行有界批处理 ,程序将在其中知道何时处理最后一条记录。

Batch: 批量:

  • Reading a flat file 读取平面文件
  • Send each line as message to Kafka 将每一行作为消息发送给Kafka

Kafka Listener: Kafka监听器:

  • Consumes message from Kafka 消耗来自Kafka的消息
  • Insert record into database 将记录插入数据库
  • If it is the last record, mark batch job as done in database. 如果它是最后一条记录,则在数据库中将批处理作业标记为已完成。

One way probably is to use a single Kafka partition, assuming FIFO (First In First Out) is guaranteed, and make the batch program to send an isLastRecord flag. 一种方法可能是使用单个Kafka分区,假设可以保证FIFO(先进先出),然后使批处理程序发送isLastRecord标志。

However, this means the processing will be restricted to single-thread (single consumer). 但是,这意味着处理将仅限于单线程(单用户)。

Question

Is there any way to achieve this with parallel-processing by leveraging multiple Kafka partitions? 有什么办法可以利用多个Kafka分区来实现并行处理?

If you need in-order guarantees per file, you are restricted to a single partition. 如果您需要每个文件的有序保证,则只能使用一个分区。

If you have multiple files, you could use different partitions for different files though. 如果您有多个文件,则可以对不同的文件使用不同的分区。

If each line in the file is an insert into a database, I am wondering though if you need in-order guarantee in the first place, or if you can insert all records/lines in any order? 如果文件中的每一行都是数据库的插入,我想知道是否首先需要顺序保证,还是可以按任何顺序插入所有记录/行?

A more fundamental question is: why do you need to put the data into Kafka first? 一个更基本的问题是:为什么首先需要将数据放入Kafka? Why not read the file and to the insert directly? 为什么不读取文件并直接插入?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM