Create n number of task and execute them in parallel in Spring Batch

Question

I have requirement where read 100 of S3 folder's csv file. In Single execution, it may get files in only few S3 folders like 60 folders have files. I need to process those 60 files and publish those data into Kafka topic. This job needs to scheduled each 4hr. And CSV data can be small records & with huge data like 6 GB also. I have to develop in Java and deploy into AWS. Thinking to use Spring Batch: Like below steps: 1. Traverse all 100 S3 folders and identify each folder which has files eg 60 folder has files. 2. create those many jobs\task like eg 60 jobs and execute them in parallel.

restriction: I should not use AWS EMR for this process.

Please suggest me a good approach to handle this best performance, with minimal failure data process.

Answer 1

Here is one possible approach for you to think about. (Fyi, i have done file processing using spring-batch and threading using the strategy i am outlining here. But that code belongs to my company and cannot share it.) I would suggest you read these articles to understand how to scale up using spring-batch.

First, spring-batch documentation https://docs.spring.io/spring-batch/docs/current/reference/html/scalability.html

Next, a good post from stackoverflow itself. Best Spring batch scaling strategy

After reading both and understanding all the different ways, i would suggest you concentrate on Partitioning, https://docs.spring.io/spring-batch/docs/current/reference/html/scalability.html#partitioning

This is the technique i used as well. In your case, you can spawn the number of threads for each file from the partitioner.

You may need to maintain the state, ie, if the file is assigned to a thread or not. 'Processing' and 'Completed Processing' also could be states in your code. This depends on your requirement. (I had a whole set of states maintained in a singleton which all threads would update after picking up a file, and finished processing a file etc)

You also need to think about finishing each file before the 4 hour window is over. You may be able to keep the file as is, or you may want to move this to a new location while processing. or rename the file while processing. Again it depends on your requirements. But you need to think about this scenario. (In my case, i renamed the file by adding a unique suffix made of timestamp in milliseconds, so it could not be overwritten by new file. )

Finally, a sample from a blog which processes 5 csv files through partitioner. You can start from this sample. https://www.baeldung.com/spring-batch-partitioner

And search for more samples to see if this is the approach you want to take. Good luck.

Answer 2

For your use case, if all files are same type(ie if it can be processed one by one), then you can use below option.

Using ResourceLoader, we can read files in S3 in ItemReader as like other resource. This would help to read files in S3 in chunks instead of loading entire file into memory.

With the dependencies injected for ResourceLoader and AmazonS3 client , have your reader configuration as below:

Replace values for sourceBucket and sourceObjectPrefix as needed.

@Autowired
private ResourceLoader resourceLoader;

@Autowired
private AmazonS3 amazonS3Client;

// READER
@Bean(destroyMethod="")
@StepScope
public SynchronizedItemStreamReader<Employee> employeeDataReader() {
    SynchronizedItemStreamReader synchronizedItemStreamReader = new SynchronizedItemStreamReader();
    List<Resource> resourceList = new ArrayList<>();
    String sourceBucket = yourBucketName;
    String sourceObjectPrefix = yourSourceObjectPrefix;
    log.info("sourceObjectPrefix::"+sourceObjectPrefix);
    ListObjectsRequest listObjectsRequest = new ListObjectsRequest()
            .withBucketName(sourceBucket)
            .withPrefix(sourceObjectPrefix);
    ObjectListing sourceObjectsListing;
    do{
        sourceObjectsListing = amazonS3Client.listObjects(listObjectsRequest);
        for (S3ObjectSummary sourceFile : sourceObjectsListing.getObjectSummaries()){

            if(!(sourceFile.getSize() > 0)
                    || (!sourceFile.getKey().endsWith(DOT.concat("csv")))
            ){
                // Skip if file is empty (or) file extension is not "csv"
                continue;
            }
            log.info("Reading "+sourceFile.getKey());
            resourceList.add(resourceLoader.getResource("s3://".concat(sourceBucket).concat("/")
                    .concat(sourceFile.getKey())));
        }
        listObjectsRequest.setMarker(sourceObjectsListing.getNextMarker());
    }while(sourceObjectsListing.isTruncated());

    Resource[] resources = resourceList.toArray(new Resource[resourceList.size()]);
    MultiResourceItemReader<Employee> multiResourceItemReader = new MultiResourceItemReader<>();
    multiResourceItemReader.setName("employee-multiResource-Reader");
    multiResourceItemReader.setResources(resources);
    multiResourceItemReader.setDelegate(employeeFileItemReader());
    synchronizedItemStreamReader.setDelegate(multiResourceItemReader);
    return synchronizedItemStreamReader;
}

@Bean
@StepScope
public FlatFileItemReader<Employee> employeeFileItemReader()
{
    FlatFileItemReader<Employee> reader = new FlatFileItemReader<Employee>();
    reader.setLinesToSkip(1);
    reader.setLineMapper(new DefaultLineMapper() {
        {
            setLineTokenizer(new DelimitedLineTokenizer() {
                {
                    setNames(Employee.fields());
                }
            });
            setFieldSetMapper(new BeanWrapperFieldSetMapper<Employee>() {
                {
                    setTargetType(Employee.class);
                }
            });
        }
    });
    return reader;
}

For each file/resource, the MultiResourceItemReader will delegate to the FlatFileItemReader being configured.

For itemProcessor part, you can scale using asyncProcessor/writer approach also as needed.

Create n number of task and execute them in parallel in Spring Batch

Question

2 answers

solution1
0 2020-05-14 17:18:36

solution2
0 2022-06-20 06:35:43

Create n number of task and execute them in parallel in Spring Batch

Question

2 answers

solution1 0 2020-05-14 17:18:36

solution2 0 2022-06-20 06:35:43

solution1
0 2020-05-14 17:18:36

solution2
0 2022-06-20 06:35:43