简体   繁体   English

重构 Spring 批处理作业以使用 Apache Kafka(解耦读者和作者)

[英]Refactoring a Spring Batch Job to use Apache Kafka (Decoupling readers and writers)

I currently have a Spring Batch Job with one single step that reads data from Oracle, passes the data through multiple Spring Batch Processors ( CompositeItemProcessor ) and writes the data to different destinations such as Oracle and files ( CompositeItemWriter ):我目前有一个 Spring 批处理作业,只需一步即可从 Oracle 读取数据,将数据传递给多个 Spring 批处理处理器 ( CompositeItemProcessor ),并将数据写入不同的目的地,例如 Oracle 和文件 ( CompositeItemWriter ):

<batch:step id="dataTransformationJob">
    <batch:tasklet transaction-manager="transactionManager" task-executor="taskExecutor" throttle-limit="30">
        <batch:chunk reader="dataReader" processor="compositeDataProcessor" writer="compositeItemWriter" commit-interval="100"></batch:chunk>
    </batch:tasklet>
</batch:step>

In the above step, the compositeItemWriter is configured with 2 writers that run one after another and write 100 million records to Oracle as well as a file.上述步骤中, compositeItemWriter配置了2个writer,依次运行,将1亿条记录写入Oracle和一个文件。 Also, the dataReader has a synchronized read method to ensure that multiple threads don't read the same data from Oracle. This job takes 1 hour 30 mins to complete as of today.此外, dataReader有一个同步读取方法,以确保多个线程不会从 Oracle 读取相同的数据。截至今天,这项工作需要 1 小时 30 分钟才能完成。

I am planning to break down the above job into two parts such that the reader/processors produce data on 2 Kafka topics (one for data to be written to Oracle and the other for data to be written to a file).我计划将上述工作分解为两部分,以便读取器/处理器生成关于 2 个 Kafka 主题的数据(一个用于将数据写入 Oracle,另一个用于将数据写入文件)。 On the other side of the equation, I will have a job with two parallel flows that read data from each topic and write the data to Oracle and file respectively.在等式的另一边,我将有一个具有两个并行流的作业,从每个主题读取数据并将数据分别写入 Oracle 和文件。

With the above architecture in mind, I wanted to understand how I can refactor a Spring Batch Job to use Kafka.考虑到上述架构,我想了解如何重构 Spring 批处理作业以使用 Kafka。 I believe the following areas is what I would need to address:我认为以下方面是我需要解决的问题:

  1. In the existing job that doesn't use Kafka, my throttle limit is 30;在现有的不使用Kafka的作业中,我的throttle limit是30; however, when I use Kafka in the middle, how does one decide the right throttle-limit?但是,当我在中间使用 Kafka 时,如何确定正确的油门限制?
  2. In the existing job I have a commit-interval of 100. This means that the CompositeItemWriter will be called for every 100 records and each writer will unpack the chunk and call the write method on it.在现有作业中,我的提交间隔为 100。这意味着每 100 条记录将调用CompositeItemWriter ,并且每个编写器将解压缩该块并对其调用 write 方法。 Does this mean that when I write to Kafka, there will be 100 publish calls to Kafka?这是否意味着当我写到 Kafka 时,将有 100 个发布调用到 Kafka?
  3. Is there a way to club multiple rows into one single message in Kafka to avoid multiple.network calls?有没有办法在 Kafka 中将多行组合成一条消息以避免 multiple.network 调用?
  4. On the consumer side, I want to have a Spring batch multi-threaded step that is able to read each partition for a topic in parallel.在消费者方面,我想要一个 Spring 批处理多线程步骤,能够并行读取主题的每个分区。 Does Spring Batch have inbuilt classes to support this already? Spring Batch 是否已经有内置类来支持它?
  5. The consumer will use standard JdbcBatchITemWriter or FlatFileItemWriter to write the data that was read from Kafka so I believe this should be standard Spring Batch in Action.消费者将使用标准的 JdbcBatchITemWriter 或 FlatFileItemWriter 来写入从 Kafka 读取的数据,所以我相信这应该是标准的 Spring Batch in Action。

Note: I am aware of Kafka Connect but don't want to use it because it requires setting up a Connect cluster and I don't have the infrastructure available to support the same.注意:我知道 Kafka Connect 但不想使用它,因为它需要设置一个 Connect 集群,而我没有可用的基础设施来支持它。

Answers to your questions:问题的答案:

  1. No throttling is needed in your kafka producer, data should be available in kafka for consumption asap.您的 kafka 生产者不需要节流,数据应该在 kafka 中可用以供尽快使用。 Your consumers could be throttled (if needed) as per the implementation.根据实施,您的消费者可能会受到限制(如果需要)。
  2. Kafka Producer is configurable. Kafka Producer 是可配置的。 100 messages do not necessarily mean 100.network calls. 100 条消息并不一定意味着 100.network 呼叫。 You could write 100 messages to kafka producer (which may or may not buffer it as per the config) and flush the buffer to force.network call.您可以将 100 条消息写入 kafka 生产者(根据配置可能会或可能不会对其进行缓冲)并将缓冲区刷新到 force.network 调用。 This would lead to (almost) the same existing behaviour.这将导致(几乎)相同的现有行为。
  3. Multiple rows can be clubbed in a single message as the payload of kafka message is entirely upto you.由于 kafka 消息的有效负载完全取决于您,因此可以将多行组合在一条消息中。 But your reasoning multiple rows into one single message in Kafka to avoid multiple.network calls?但是你multiple rows into one single message in Kafka to avoid multiple.network calls? is invalid since multiple messages (rows) can be produced/consumed in a single.network call.无效,因为可以在 single.network 调用中生成/使用多条消息(行)。 For your first draft, I would suggest to keep it simple by having a single row correspond to a single message.对于您的初稿,我建议通过让一行对应一条消息来保持简单。
  4. Not as far as I know.据我所知不是。 (but I could be wrong on this one) (但我在这方面可能是错的)
  5. Yes I believe they should work just fine.是的,我相信他们应该工作得很好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM