简体   繁体   English

spring批处理中的Multi-threaded Step和Local Partitioning有什么区别?

[英]What is the difference between Multi-threaded Step and Local Partitioning in spring batch?

I've the following doc .我有以下文档

And there are mentioned that:并且提到:

1.1. 1.1. Multi-threaded Step The simplest way to start parallel processing is to add a TaskExecutor to your Step configuration.多线程 Step 开始并行处理的最简单方法是将 TaskExecutor 添加到您的 Step 配置。

When using java configuration, a TaskExecutor can be added to the step as shown in the following example:使用 java 配置时,可以将 TaskExecutor 添加到步骤中,如下例所示:

@Bean
public TaskExecutor taskExecutor(){
    return new SimpleAsyncTaskExecutor("spring_batch");
}

@Bean
public Step sampleStep(TaskExecutor taskExecutor) {
        return this.stepBuilderFactory.get("sampleStep")
                                .<String, String>chunk(10)
                                .reader(itemReader())
                                .writer(itemWriter())
                                .taskExecutor(taskExecutor)
                                .build();
}

The result of the above configuration is that the Step executes by reading, processing, and writing each chunk of items (each commit interval) in a separate thread of execution.上述配置的结果是 Step 通过在单独的执行线程中读取、处理和写入每个项目块(每个提交间隔)来执行。 Note that this means there is no fixed order for the items to be processed, and a chunk might contain items that are non-consecutive compared to the single-threaded case.请注意,这意味着要处理的项目没有固定顺序,并且与单线程情况相比,块可能包含不连续的项目。 In addition to any limits placed by the task executor (such as whether it is backed by a thread pool), there is a throttle limit in the tasklet configuration which defaults to 4. You may need to increase this to ensure that a thread pool is fully utilized.除了任务执行器设置的任何限制(例如它是否由线程池支持)之外,tasklet 配置中还有一个默认为 4 的节流限制。您可能需要增加它以确保线程池是充分利用。

But before I thought that it should be achieved by local partitioning and I should provide a partitioner which say how to divide data into pieces.但在我认为它应该通过本地分区来实现之前,我应该提供一个分区器来说明如何将数据分成几块。 Multi-threaded Step should do it automatically.多线程 Step 应该自动执行。

Question

Could you explain how does it work?你能解释一下它是如何工作的吗? How can I manage it besides the thread number?除了线程号外,我该如何管理它? Will it work for flat file?它适用于平面文件吗?

PS聚苯乙烯

I created the example:我创建了示例:

@Configuration
public class MultithreadedStepConfig {

    @Autowired
    public JobBuilderFactory jobBuilderFactory;

    @Autowired
    public StepBuilderFactory stepBuilderFactory;
    @Autowired
    private ToLowerCasePersonProcessor toLowerCasePersonProcessor;

    @Autowired
    private DbPersonWriter dbPersonWriter;

    @Value("${app.single-file}")
    Resource resources;

    @Bean
    public Job job(Step databaseToDataBaseLowercaseSlaveStep) {
        return jobBuilderFactory.get("myMultiThreadedJob")
                .incrementer(new RunIdIncrementer())
                .flow(csvToDataBaseSlaveStep())
                .end()
                .build();
    }

    private Step csvToDataBaseSlaveStep() {
        return stepBuilderFactory.get("csvToDatabaseStep")
                .<Person, Person>chunk(50)
                .reader(csvPersonReaderMulti())
                .processor(toLowerCasePersonProcessor)
                .writer(dbPersonWriter)
                .taskExecutor(jobTaskExecutorMultiThreaded())
                .build();

    }

    @Bean
    @StepScope
    public FlatFileItemReader csvPersonReaderMulti() {
        return new FlatFileItemReaderBuilder()
                .name("csvPersonReaderSplitted")
                .resource(resources)
                .delimited()
                .names(new String[]{"firstName", "lastName"})
                .fieldSetMapper(new BeanWrapperFieldSetMapper<Person>() {{
                    setTargetType(Person.class);
                }})
                .saveState(false)
                .build();

    }

    @Bean
    public TaskExecutor jobTaskExecutorMultiThreaded() {
        ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
        // there are 21 sites currently hence we have 21 threads
        taskExecutor.setMaxPoolSize(30);
        taskExecutor.setCorePoolSize(25);
        taskExecutor.setThreadGroupName("multi-");
        taskExecutor.setThreadNamePrefix("multi-");
        taskExecutor.afterPropertiesSet();
        return taskExecutor;
    }
}

And it really works according the log but I want to know details.根据日志,它确实有效,但我想知道详细信息。 Is it better than self written partitioner?它比自己编写的分区程序好吗?

There is basically fundamental differences here when you use multi-threaded steps and partitions. 使用多线程步骤和分区时,这里基本上存在根本的区别。

Multi-threaded steps are single process so it is not a good idea to use this if you have persisted state for processor/writer. 多线程步骤是单个过程,因此如果您保持处理器/写入器的状态不变,则最好不要使用此步骤。 However say if you just generating a report without saving anything this is a good choice. 但是,如果您仅生成报告而不保存任何内容,那是个不错的选择。

As you have mentioned say you want to process a flat file and say you want store the records in DB then you can use remote chunking concept assuming your reader is not heavy. 正如您所提到的,您要处理一个平面文件并想将记录存储在DB中,然后可以使用远程分块概念,前提是您的阅读器并不繁重。

Partitioner will create separate process for a each set of data you can use logically divide. 分区程序将为您可以使用逻辑除法的每组数据创建单独的过程。

Hope this helps. 希望这可以帮助。

based on my understanding, partition is typically used for remote processing.根据我的理解,分区通常用于远程处理。 Partition master (or manager) step will create multiple identical workers.分区主机(或管理器)步骤将创建多个相同的工人。 The number of works are the given grid size.作品的数量是给定的网格大小。 For local processing, those workers are identical, meaning the same reader, writer objects, but execute on different thread with different chunks of input provided by the customized partitioner.对于本地处理,这些工作人员是相同的,意味着相同的读取器、写入器对象,但在不同的线程上执行,具有自定义分区程序提供的不同输入块。 However, if the worker step has listeners, before/after step methods, those methods will be call by each worker;但是,如果工作步骤有侦听器、步骤之前/之后的方法,则这些方法将由每个工作人员调用; on the contrary, those method get called only once in multi-thread step scheme.相反,这些方法在多线程步进方案中只被调用一次。 Other than that, I don't see any differences for local processing.除此之外,我没有看到本地处理有任何差异。

I personally suggest don't use partition for local processing, use multi-thread step instead.我个人建议不要使用分区进行本地处理,而是使用多线程步骤。 There are so many open source packages, don't use them if you don't feel comfortable.开源包那么多,不爽就别用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM