简体   繁体   中英

spring batch structure for parallel processing

I am seeking some guidance please on how to structure a spring batch application to ingest a bunch of potentially large delimited files, each with a different format.

The requirements are clear:

  1. select the files to ingest from an external source: there can be multiple releases of some files each day so the latest release must be picked
  2. turn each line of each file into json by combining the delimited fields with the column names of the first line (which is skipped)
  3. send each line of json to a RESTFul Api

We have one step which uses a MultiResourceItemReader which processes files in sequence. The files are inputstreams which time out.

Ideally I think we want to have

  1. a step which identifies the files to ingest
  2. a step which processes files in parallel

Thanks in advance.

This is a fun one. I'd implement a customer line tokenizer that extends DelimitedLineTokenizer and also implements LineCallbackHandler . I'd then configure your FlatFileItemReader to skip the first line (list of column names) and pass that first line to your handler/tokenizer to set all your token names.

A custom FieldSetMapper would then receive a FieldSet with all your name/value pairs, which I'd just pass to the ItemProcessor . Your processor could then build your JSON strings and pass them off to your writer.

Obviously, you job falls into typical - reader -> processor -> writer category with writer being optional in your case ( if you don't wish to persist JSON before sending to RESTFul API) or you can call step to send JSON to REST Service as Writer if Writer is done after receiving response from service.

Anyway, you don't need a separate step to just know the file name. Make it part of application initialization code.

Strategies to parallelize your application are listed here .

You just said a bunch of files. If number of lines in those files have similar count, I would go by partitioning approach ( ie by implementing Partitioner interface, I will hand over each file to a separate thread and that thread will execute a step - reader -> processor -> writer ). You wouldn't need MultiResourceItemReader in this case but simple single file reader as each file will have its own reader. Partitioning

If line count in those files vary a lot ie if one file is going to take hours and another getting finished in few minutes, you can continue using MultiResourceItemReader but use approach of Multi-threaded Step to achieve parallelism.This is chunk level parallelism so you might have to make reader thread safe.

Approach Parallel Steps doesn't look suitable for your case since your steps are not independent.

Hope it helps !!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM