[英]Refactoring a Spring Batch Job to use Apache Kafka (Decoupling readers and writers)
I currently have a Spring Batch Job with one single step that reads data from Oracle, passes the data through multiple Spring Batch Processors ( CompositeItemProcessor
) and writes the data to different destinations such as Oracle and files ( CompositeItemWriter
):我目前有一个 Spring 批处理作业,只需一步即可从 Oracle 读取数据,将数据传递给多个 Spring 批处理处理器 ( CompositeItemProcessor
),并将数据写入不同的目的地,例如 Oracle 和文件 ( CompositeItemWriter
):
<batch:step id="dataTransformationJob">
<batch:tasklet transaction-manager="transactionManager" task-executor="taskExecutor" throttle-limit="30">
<batch:chunk reader="dataReader" processor="compositeDataProcessor" writer="compositeItemWriter" commit-interval="100"></batch:chunk>
</batch:tasklet>
</batch:step>
In the above step, the compositeItemWriter
is configured with 2 writers that run one after another and write 100 million records to Oracle as well as a file.上述步骤中, compositeItemWriter
配置了2个writer,依次运行,将1亿条记录写入Oracle和一个文件。 Also, the dataReader
has a synchronized read method to ensure that multiple threads don't read the same data from Oracle. This job takes 1 hour 30 mins to complete as of today.此外, dataReader
有一个同步读取方法,以确保多个线程不会从 Oracle 读取相同的数据。截至今天,这项工作需要 1 小时 30 分钟才能完成。
I am planning to break down the above job into two parts such that the reader/processors produce data on 2 Kafka topics (one for data to be written to Oracle and the other for data to be written to a file).我计划将上述工作分解为两部分,以便读取器/处理器生成关于 2 个 Kafka 主题的数据(一个用于将数据写入 Oracle,另一个用于将数据写入文件)。 On the other side of the equation, I will have a job with two parallel flows that read data from each topic and write the data to Oracle and file respectively.在等式的另一边,我将有一个具有两个并行流的作业,从每个主题读取数据并将数据分别写入 Oracle 和文件。
With the above architecture in mind, I wanted to understand how I can refactor a Spring Batch Job to use Kafka.考虑到上述架构,我想了解如何重构 Spring 批处理作业以使用 Kafka。 I believe the following areas is what I would need to address:我认为以下方面是我需要解决的问题:
CompositeItemWriter
will be called for every 100 records and each writer will unpack the chunk and call the write method on it.在现有作业中,我的提交间隔为 100。这意味着每 100 条记录将调用CompositeItemWriter
,并且每个编写器将解压缩该块并对其调用 write 方法。 Does this mean that when I write to Kafka, there will be 100 publish calls to Kafka?这是否意味着当我写到 Kafka 时,将有 100 个发布调用到 Kafka?Note: I am aware of Kafka Connect but don't want to use it because it requires setting up a Connect cluster and I don't have the infrastructure available to support the same.注意:我知道 Kafka Connect 但不想使用它,因为它需要设置一个 Connect 集群,而我没有可用的基础设施来支持它。
Answers to your questions:问题的答案:
multiple rows into one single message in Kafka to avoid multiple.network calls?
但是你multiple rows into one single message in Kafka to avoid multiple.network calls?
is invalid since multiple messages (rows) can be produced/consumed in a single.network call.无效,因为可以在 single.network 调用中生成/使用多条消息(行)。 For your first draft, I would suggest to keep it simple by having a single row correspond to a single message.对于您的初稿,我建议通过让一行对应一条消息来保持简单。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.