简体繁体 English

弹簧批处理结构，用于并行处理

[英]spring batch structure for parallel processing

原文 2016-09-06 11:34:59 8 2 java/ spring-batch

I am seeking some guidance please on how to structure a spring batch application to ingest a bunch of potentially large delimited files, each with a different format. 我正在寻求有关如何构造Spring Batch应用程序以获取一堆可能很大的分隔文件（每个文件具有不同格式）的指南。

The requirements are clear: 要求很明确：

select the files to ingest from an external source: there can be multiple releases of some files each day so the latest release must be picked 从外部来源中选择要提取的文件：每天可能有多个文件的多个发行版，因此必须选择最新的发行版
turn each line of each file into json by combining the delimited fields with the column names of the first line (which is skipped) 通过将分隔的字段与第一行的列名组合（将跳过），将每个文件的每一行转换为json
send each line of json to a RESTFul Api 将json的每一行发送到RESTFul Api

We have one step which uses a MultiResourceItemReader which processes files in sequence. 我们有一个使用MultiResourceItemReader的步骤，该顺序依次处理文件。 The files are inputstreams which time out. 文件是超时的输入流。

Ideally I think we want to have 理想情况下，我认为我们希望拥有

a step which identifies the files to ingest 识别要提取的文件的步骤
a step which processes files in parallel 并行处理文件的步骤

Thanks in advance. 提前致谢。

2 个解决方案

This is a fun one. 这很有趣。 I'd implement a customer line tokenizer that extends DelimitedLineTokenizer and also implements LineCallbackHandler . 我将实现一个客户线令牌生成器，该客户端扩展了DelimitedLineTokenizer并还实现了LineCallbackHandler 。 I'd then configure your FlatFileItemReader to skip the first line (list of column names) and pass that first line to your handler/tokenizer to set all your token names. 然后，我将配置FlatFileItemReader以跳过第一行（列名称列表），并将第一行传递给处理程序/令牌生成器以设置所有令牌名称。

A custom FieldSetMapper would then receive a FieldSet with all your name/value pairs, which I'd just pass to the ItemProcessor . 然后，自定义FieldSetMapper将收到一个包含所有名称/值对的FieldSet ，我将其传递给ItemProcessor 。 Your processor could then build your JSON strings and pass them off to your writer. 然后，您的处理器可以构建您的JSON字符串，并将其传递给您的编写器。

Obviously, you job falls into typical - reader -> processor -> writer category with writer being optional in your case ( if you don't wish to persist JSON before sending to RESTFul API) or you can call step to send JSON to REST Service as Writer if Writer is done after receiving response from service. 显然，您的工作属于典型的- 读取器->处理器->写入器类别，在您的情况下，写入器是可选的（如果您不希望在发送到RESTFul API之前保留JSON），或者可以调用步骤将JSON发送到REST服务如果Writer在收到服务响应后完成，则作为Writer 。

Anyway, you don't need a separate step to just know the file name. 无论如何，您不需要单独的步骤即可知道文件名。 Make it part of application initialization code. 使它成为应用程序初始化代码的一部分。

Strategies to parallelize your application are listed here . 此处列出了并行化应用程序的策略。

You just said a bunch of files. 您刚才说了一堆文件。 If number of lines in those files have similar count, I would go by partitioning approach ( ie by implementing Partitioner interface, I will hand over each file to a separate thread and that thread will execute a step - reader -> processor -> writer ). 如果这些文件中的行数具有相似的计数，我将采用分区方法（即，通过实现Partitioner接口，我将每个文件移交给一个单独的线程，并且该线程将执行一个步骤- 读取器->处理器->写入器 ）。 You wouldn't need MultiResourceItemReader in this case but simple single file reader as each file will have its own reader. 在这种情况下，您不需要MultiResourceItemReader ，而是简单的单个文件阅读器，因为每个文件都有自己的阅读器。 Partitioning 分区

If line count in those files vary a lot ie if one file is going to take hours and another getting finished in few minutes, you can continue using MultiResourceItemReader but use approach of Multi-threaded Step to achieve parallelism.This is chunk level parallelism so you might have to make reader thread safe. 如果这些文件中的行数变化很大，即一个文件要花费数小时而另一个文件要在几分钟内完成，则可以继续使用MultiResourceItemReader但可以使用多线程步骤的方法来实现并行性。这是块级并行性，因此您可能必须使阅读器线程安全。

Approach Parallel Steps doesn't look suitable for your case since your steps are not independent. 接近并行步骤似乎不适合您的情况，因为您的步骤不是独立的。

Hope it helps !! 希望能帮助到你！！