[英]spring batch structure for parallel processing
I am seeking some guidance please on how to structure a spring batch application to ingest a bunch of potentially large delimited files, each with a different format. 我正在寻求有关如何构造Spring Batch应用程序以获取一堆可能很大的分隔文件(每个文件具有不同格式)的指南。
The requirements are clear: 要求很明确:
We have one step which uses a MultiResourceItemReader which processes files in sequence. 我们有一个使用MultiResourceItemReader的步骤,该顺序依次处理文件。 The files are inputstreams which time out. 文件是超时的输入流。
Ideally I think we want to have 理想情况下,我认为我们希望拥有
Thanks in advance. 提前致谢。
This is a fun one. 这很有趣。 I'd implement a customer line tokenizer that extends DelimitedLineTokenizer
and also implements LineCallbackHandler
. 我将实现一个客户线令牌生成器,该客户端扩展了DelimitedLineTokenizer
并还实现了LineCallbackHandler
。 I'd then configure your FlatFileItemReader
to skip the first line (list of column names) and pass that first line to your handler/tokenizer to set all your token names. 然后,我将配置FlatFileItemReader
以跳过第一行(列名称列表),并将第一行传递给处理程序/令牌生成器以设置所有令牌名称。
A custom FieldSetMapper
would then receive a FieldSet
with all your name/value pairs, which I'd just pass to the ItemProcessor
. 然后,自定义FieldSetMapper
将收到一个包含所有名称/值对的FieldSet
,我将其传递给ItemProcessor
。 Your processor could then build your JSON strings and pass them off to your writer. 然后,您的处理器可以构建您的JSON字符串,并将其传递给您的编写器。
Obviously, you job falls into typical - reader -> processor -> writer category with writer being optional in your case ( if you don't wish to persist JSON before sending to RESTFul API) or you can call step to send JSON to REST Service as Writer
if Writer
is done after receiving response from service. 显然,您的工作属于典型的- 读取器->处理器->写入器类别,在您的情况下,写入器是可选的(如果您不希望在发送到RESTFul API之前保留JSON),或者可以调用步骤将JSON发送到REST服务如果Writer
在收到服务响应后完成,则作为Writer
。
Anyway, you don't need a separate step to just know the file name. 无论如何,您不需要单独的步骤即可知道文件名。 Make it part of application initialization code. 使它成为应用程序初始化代码的一部分。
Strategies to parallelize your application are listed here . 此处列出了并行化应用程序的策略。
You just said a bunch of files. 您刚才说了一堆文件。 If number of lines in those files have similar count, I would go by partitioning approach ( ie by implementing Partitioner
interface, I will hand over each file to a separate thread and that thread will execute a step - reader -> processor -> writer ). 如果这些文件中的行数具有相似的计数,我将采用分区方法(即,通过实现Partitioner
接口,我将每个文件移交给一个单独的线程,并且该线程将执行一个步骤- 读取器->处理器->写入器 ) 。 You wouldn't need MultiResourceItemReader
in this case but simple single file reader as each file will have its own reader. 在这种情况下,您不需要MultiResourceItemReader
,而是简单的单个文件阅读器,因为每个文件都有自己的阅读器。 Partitioning 分区
If line count in those files vary a lot ie if one file is going to take hours and another getting finished in few minutes, you can continue using MultiResourceItemReader
but use approach of Multi-threaded Step to achieve parallelism.This is chunk level parallelism so you might have to make reader thread safe. 如果这些文件中的行数变化很大,即一个文件要花费数小时而另一个文件要在几分钟内完成,则可以继续使用MultiResourceItemReader
但可以使用多线程步骤的方法来实现并行性。这是块级并行性,因此您可能必须使阅读器线程安全。
Approach Parallel Steps doesn't look suitable for your case since your steps are not independent. 接近并行步骤似乎不适合您的情况,因为您的步骤不是独立的。
Hope it helps !! 希望能帮助到你 !!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.