Why is my Spring Batch multi-threaded step executing all reads before any processing?

Question

I'm attempting to write a Spring Batch process for converting millions of entries in a legacy DB, with a sprawling schema, into a streamlined JSON format and publishing that JSON to GCP PubSub. In order to make this process as efficient as possible, I'm attempting to leverage a Spring-Batch multi-threaded Step.

To test my process, I've started small, with a page size and chunk size of 5, a limit of 20 entries total to process, and a thread-pool of just 1 thread. I'm attempting to step through the process to validate it's working as I expected - but it's not.

I expected that configuring my RepositoryItemReader with a page size of 5, would cause it to read just 5 records from the DB - and then process those records in a single chunk of 5 before reading the next 5. But that's not what's happening. Instead, in the logs, since I have hibernate show-sql enabled, I can see the reader reads ALL 20 records before any processing starts.

Why is my multithreaded step peforming ALL of its reading before executing any processing? Have I misconfigured it? Obviously I wouldn't want my Job trying to load millions of DTOs into memory before it starts processing anything...

Here's how I've configured my job:

@Configuration
public class ConversionBatchJobConfig {

    @Bean
    public SimpleCompletionPolicy processChunkSize(@Value("${commit.chunk.size:5}") Integer chunkSize) {
        return new SimpleCompletionPolicy(chunkSize);
    }

    @Bean
    @StepScope
    public ItemStreamReader<DbProjection> dbReader(
            MyDomainRepository myDomainRepository,
            @Value("#{jobParameters[pageSize]}") Integer pageSize, //pageSize and chunkSize both 5 for now
            @Value("#{jobParameters[limit]}") Integer limit) { //limit is 40
        RepositoryItemReader<DbProjection> myDomainRepositoryReader = new RepositoryItemReader<>();
        myDomainRepositoryReader.setRepository(myDomainRepository);
        myDomainRepositoryReader.setMethodName("findActiveDbDomains"); //A native query
        myDomainRepositoryReader.setArguments(new ArrayList<Object>() {{
            add("ACTIVE");
        }});
        myDomainRepositoryReader.setSort(new HashMap<String, Sort.Direction>() {{
            put("update_date", Sort.Direction.ASC);
        }});
        myDomainRepositoryReader.setPageSize(pageSize);
        myDomainRepositoryReader.setMaxItemCount(limit);
        myDomainRepositoryReader.setSaveState(false);
        return myDomainRepositoryReader;
    }

    @Bean
    @StepScope
    public ItemProcessor<DbProjection, JsonMessage> dataConverter(DataRetrievalService dataRetrievalService) {
        return new DbProjectionToJsonMessageConverter(dataRetrievalService);
    }

    @Bean
    @StepScope
    public ItemWriter<JsonMessage> jsonPublisher(GcpPubsubPublisherService publisherService) {
        return new JsonMessageWriter(publisherService);
    }

    @Bean
    public Step conversionProcess(SimpleCompletionPolicy processChunkSize,
                                  ItemStreamReader<DbProjection> dbReader,
                                  ItemProcessor<DbProjection, JsonMessage> dataConverter,
                                  ItemWriter<JsonMessage> jsonPublisher,
                                  StepBuilderFactory stepBuilderFactory,
                                  TaskExecutor conversionThreadPool,
                                  @Value("${conversion.failure.limit:20}") int maximumFailures) {
        return stepBuilderFactory.get("conversionProcess")
                .<DbProjection, JsonMessage>chunk(processChunkSize)
                .reader(dbReader)
                .processor(dataConverter)
                .writer(jsonPublisher)
                .faultTolerant()
                .skipPolicy(new MyCustomConversionSkipPolicy(maximumFailures))
                            //  ^ for now this returns true for everything until 20 failures
                    .listener(new MyConversionSkipListener(processStatus))
                              //  ^ for now this just logs the error
                .taskExecutor(conversionThreadPool)
                .build();
    }

    @Bean
    public Job conversionJob(Step conversionProcess,
                             JobBuilderFactory jobBuilderFactory) {
        return jobBuilderFactory.get("conversionJob")
                .start(conversionProcess)
                .build();
    }
}

Answer 1

You need to check the value of hibernate.jdbc.fetch_size and set it accordingly.

The pageSize and fetchSize are different parameters. You can find more details on the difference here: https://stackoverflow.com/a/58058009/5019386 . So in your case, if the fetchSize is bigger than pageSize , then it's possible that more records are fetched than the page size.

Why is my Spring Batch multi-threaded step executing all reads before any processing?

Question

1 answers

solution1
1 ACCPTED 2020-06-02 08:37:38

Why is my Spring Batch multi-threaded step executing all reads before any processing?

Question

1 answers

solution1 1 ACCPTED 2020-06-02 08:37:38

solution1
1 ACCPTED 2020-06-02 08:37:38