简体繁体 English

在Spring Batch分区中配置gridSize

[英]Configuring gridSize in Spring Batch partitioning

原文 2015-07-14 19:58:54 5 1 java/ parallel-processing/ spring-batch

In Spring Batch partitioning, the relationship between the gridSize of the PartitionHandler and the number of ExecutionContext s returned by the Partitioner is a little confusing. 在Spring Batch的分区之间的关系gridSize中的PartitionHandler和数量的ExecutionContext通过传回的分区程序是有点混乱。 For example, MultiResourcePartitioner states that it ignores gridSize, but the Partitioner documentation doesn't explain when/why this is acceptable to do. 例如， MultiResourcePartitioner声明它忽略了gridSize，但是Partitioner文档没有解释何时/为什么这是可以接受的。

For example, let's say I have a taskExecutor that I want to re-use across different parallel steps, and that I set its size to 20. If I use a TaskExecutorPartitionerHandler with a grid size of 5, and a MultiResourcePartitioner that returns an arbitrary number of partitions (one per file), how will the parallelism actually behave? 例如，假设我有一个taskExecutor ，我想在不同的并行步骤中重用它，并且我将其大小设置为20.如果我使用网格大小为5的TaskExecutorPartitionerHandler和返回任意数字的MultiResourcePartitioner分区（每个文件一个），并行性如何实际表现？

Let's say the MultiResourcePartitioner returns 10 partitions for a particular run. 假设MultiResourcePartitioner为特定运行返回10个分区。 Does this mean that only 5 of them will execute at a time until all 10 have completed, and that no more than 5 of the 20 threads will be used for this step? 这是否意味着它们中只有5个会一直执行，直到所有10个完成，并且20个线程中不会有超过5个线程用于此步骤？

If this is the case, when/why is it okay to ignore the 'gridSize' parameter when overriding Parititioner with a custom implementation? 如果是这种情况下，当/为什么是好的覆盖时忽略了“gridSize”参数Parititioner用自定义实现？ I think it would help if this was described in the documentation. 我认为如果在文档中描述这将有所帮助。

If this isn't the case, how can I achieve this? 如果不是这种情况，我该如何实现？ That is, how can I re-use a task executor and separately define the number of partitions that can run parallel for that step and the number of partitions that actually get created? 也就是说，如何重新使用任务执行程序并单独定义可以为该步骤并行运行的分区数以及实际创建的分区数？

1 个解决方案

There are a few good questions here so let's walk through them individually: 这里有一些很好的问题，所以让我们逐个介绍它们：

For example, let's say I have a taskExecutor that I want to re-use across different parallel steps, and that I set its size to 20. If I use a TaskExecutorPartitionerHandler with a grid size of 5, and a MultiResourcePartitioner that returns an arbitrary number of partitions (one per file), how will the parallelism actually behave? 例如，假设我有一个taskExecutor，我希望在不同的并行步骤中重用它，并将其大小设置为20.如果我使用网格大小为5的TaskExecutorPartitionerHandler和返回任意数字的MultiResourcePartitioner分区（每个文件一个），并行性如何实际表现？

The TaskExecutorPartitionHandler defers the concurrency limitations to the TaskExecutor you provide. TaskExecutorPartitionHandler将并发限制推迟到您提供的TaskExecutor 。 Because of this, in your example, the PartitionHandler will use up to all 20 threads, as the TaskExecutor allows. 因此，在您的示例中， PartitionHandler将使用最多20个线程，如TaskExecutor允许的那样。

If this is the case, when/why is it okay to ignore the 'gridSize' parameter when overriding Parititioner with a custom implementation? 如果是这种情况，何时/为什么在用自定义实现覆盖Parititioner时可以忽略'gridSize'参数？ I think it would help if this was described in the documentation. 我认为如果在文档中描述这将有所帮助。

When we look at a partitioned step, there are two components of concern: the Partitioner and the PartitionHandler . 当我们查看分区步骤时，有两个值得关注的组件： Partitioner和PartitionHandler 。 The Partitioner is responsible for understanding the data to be divided up and how best to do so. Partitioner程序负责了解要分割的数据以及最佳方式。 The PartitionHandler is responsible for delegating that work out to slaves for execution. PartitionHandler负责将该工作委托给从属程序执行。 In order for the PartitionHandler to do its delegation, it needs to understand the "fabric" that it's working with (local threads, remote slave processes, etc). 为了让PartitionHandler执行其委派，它需要了解它正在使用的“结构”（本地线程，远程从属进程等）。

When dividing up the data to be worked on (via the Partitioner ) it can be useful to know how many workers are available. 当瓜分数据要工作（通过Partitioner ），它可以是有用的知道有多少工人可用。 However, that metric isn't always very useful based on the data you're working with. 但是，根据您使用的数据，该指标并不总是非常有用。 For example, dividing database rows, it makes sense to divide them evenly by the number of workers available. 例如，划分数据库行，将它们均匀地除以可用的工作者数量是有意义的。 However it's impractical in most scenarios to combine or divide files up so it's just easier to create a partition per file. 但是，在大多数情况下，将文件组合或分割是不切实际的，因此每个文件创建分区更容易。 Both of these scenarios are dependent upon the data you're trying to divide up as to whether the gridSize is useful or not. 这两种情况都取决于您试图划分的数据，以确定gridSize是否有用。

If this isn't the case, how can I achieve this? 如果不是这种情况，我该如何实现？ That is, how can I re-use a task executor and separately define the number of partitions that can run parallel for that step and the number of partitions that actually get created? 也就是说，如何重新使用任务执行程序并单独定义可以为该步骤并行运行的分区数以及实际创建的分区数？

If you're re-using a TaskExecutor , you may not be able to since that TaskExecutor may be doing other things. 如果您正在重新使用TaskExecutor ，那么您可能无法执行TaskExecutor ，因为TaskExecutor可能正在执行其他操作。 I wonder why you'd re-use one given the relatively low overhead of creating one dedicated (you can even make it step scoped so it's only created when the partitioned step is running). 我想知道为什么你会重新使用一个，因为创建一个专用的开销相对较低（你甚至可以让它跨步作用，所以它只在分区步骤运行时创建）。