简体繁体 English

Kafka：并行的有边界批处理

[英]Kafka: Bounded Batch Processing in Parallel

原文 2018-05-08 00:29:29 6 1 java/ apache-kafka/ batch-processing/ spring-cloud-stream

I would like to use Kafka to perform bounded batch processing , where the program will know when it is processing the last record. 我想使用Kafka进行有界批处理 ，程序将在其中知道何时处理最后一条记录。

Batch: 批量：

Reading a flat file 读取平面文件
Send each line as message to Kafka 将每一行作为消息发送给Kafka

Kafka Listener: Kafka监听器：

Consumes message from Kafka 消耗来自Kafka的消息
Insert record into database 将记录插入数据库
If it is the last record, mark batch job as done in database. 如果它是最后一条记录，则在数据库中将批处理作业标记为已完成。

One way probably is to use a single Kafka partition, assuming FIFO (First In First Out) is guaranteed, and make the batch program to send an isLastRecord flag. 一种方法可能是使用单个Kafka分区，假设可以保证FIFO（先进先出），然后使批处理程序发送isLastRecord标志。

However, this means the processing will be restricted to single-thread (single consumer). 但是，这意味着处理将仅限于单线程（单用户）。

Question 题

Is there any way to achieve this with parallel-processing by leveraging multiple Kafka partitions? 有什么办法可以利用多个Kafka分区来实现并行处理？

1 个解决方案

If you need in-order guarantees per file, you are restricted to a single partition. 如果您需要每个文件的有序保证，则只能使用一个分区。

If you have multiple files, you could use different partitions for different files though. 如果您有多个文件，则可以对不同的文件使用不同的分区。

If each line in the file is an insert into a database, I am wondering though if you need in-order guarantee in the first place, or if you can insert all records/lines in any order? 如果文件中的每一行都是数据库的插入，我想知道是否首先需要顺序保证，还是可以按任何顺序插入所有记录/行？

A more fundamental question is: why do you need to put the data into Kafka first? 一个更基本的问题是：为什么首先需要将数据放入Kafka？ Why not read the file and to the insert directly? 为什么不读取文件并直接插入？