[英]Configuring gridSize in Spring Batch partitioning
In Spring Batch partitioning, the relationship between the gridSize
of the PartitionHandler and the number of ExecutionContext s returned by the Partitioner is a little confusing. 在Spring Batch的分区之间的关系
gridSize
中的PartitionHandler和数量的ExecutionContext通过传回的分区程序是有点混乱。 For example, MultiResourcePartitioner states that it ignores gridSize, but the Partitioner
documentation doesn't explain when/why this is acceptable to do. 例如, MultiResourcePartitioner声明它忽略了gridSize,但是
Partitioner
文档没有解释何时/为什么这是可以接受的。
For example, let's say I have a taskExecutor
that I want to re-use across different parallel steps, and that I set its size to 20. If I use a TaskExecutorPartitionerHandler with a grid size of 5, and a MultiResourcePartitioner
that returns an arbitrary number of partitions (one per file), how will the parallelism actually behave? 例如,假设我有一个
taskExecutor
,我想在不同的并行步骤中重用它,并且我将其大小设置为20.如果我使用网格大小为5的TaskExecutorPartitionerHandler和返回任意数字的MultiResourcePartitioner
分区(每个文件一个),并行性如何实际表现?
Let's say the MultiResourcePartitioner
returns 10 partitions for a particular run. 假设
MultiResourcePartitioner
为特定运行返回10个分区。 Does this mean that only 5 of them will execute at a time until all 10 have completed, and that no more than 5 of the 20 threads will be used for this step? 这是否意味着它们中只有5个会一直执行,直到所有10个完成,并且20个线程中不会有超过5个线程用于此步骤?
If this is the case, when/why is it okay to ignore the 'gridSize' parameter when overriding Parititioner
with a custom implementation? 如果是这种情况下,当/为什么是好的覆盖时忽略了“gridSize”参数
Parititioner
用自定义实现? I think it would help if this was described in the documentation. 我认为如果在文档中描述这将有所帮助。
If this isn't the case, how can I achieve this? 如果不是这种情况,我该如何实现? That is, how can I re-use a task executor and separately define the number of partitions that can run parallel for that step and the number of partitions that actually get created?
也就是说,如何重新使用任务执行程序并单独定义可以为该步骤并行运行的分区数以及实际创建的分区数?
There are a few good questions here so let's walk through them individually: 这里有一些很好的问题,所以让我们逐个介绍它们:
For example, let's say I have a taskExecutor that I want to re-use across different parallel steps, and that I set its size to 20. If I use a TaskExecutorPartitionerHandler with a grid size of 5, and a MultiResourcePartitioner that returns an arbitrary number of partitions (one per file), how will the parallelism actually behave?
例如,假设我有一个taskExecutor,我希望在不同的并行步骤中重用它,并将其大小设置为20.如果我使用网格大小为5的TaskExecutorPartitionerHandler和返回任意数字的MultiResourcePartitioner分区(每个文件一个),并行性如何实际表现?
The TaskExecutorPartitionHandler
defers the concurrency limitations to the TaskExecutor
you provide. TaskExecutorPartitionHandler
将并发限制推迟到您提供的TaskExecutor
。 Because of this, in your example, the PartitionHandler
will use up to all 20 threads, as the TaskExecutor
allows. 因此,在您的示例中,
PartitionHandler
将使用最多20个线程,如TaskExecutor
允许的那样。
If this is the case, when/why is it okay to ignore the 'gridSize' parameter when overriding Parititioner with a custom implementation?
如果是这种情况,何时/为什么在用自定义实现覆盖Parititioner时可以忽略'gridSize'参数? I think it would help if this was described in the documentation.
我认为如果在文档中描述这将有所帮助。
When we look at a partitioned step, there are two components of concern: the Partitioner
and the PartitionHandler
. 当我们查看分区步骤时,有两个值得关注的组件:
Partitioner
和PartitionHandler
。 The Partitioner
is responsible for understanding the data to be divided up and how best to do so. Partitioner
程序负责了解要分割的数据以及最佳方式。 The PartitionHandler
is responsible for delegating that work out to slaves for execution. PartitionHandler
负责将该工作委托给从属程序执行。 In order for the PartitionHandler
to do its delegation, it needs to understand the "fabric" that it's working with (local threads, remote slave processes, etc). 为了让
PartitionHandler
执行其委派,它需要了解它正在使用的“结构”(本地线程,远程从属进程等)。
When dividing up the data to be worked on (via the Partitioner
) it can be useful to know how many workers are available. 当瓜分数据要工作(通过
Partitioner
),它可以是有用的知道有多少工人可用。 However, that metric isn't always very useful based on the data you're working with. 但是,根据您使用的数据,该指标并不总是非常有用。 For example, dividing database rows, it makes sense to divide them evenly by the number of workers available.
例如,划分数据库行,将它们均匀地除以可用的工作者数量是有意义的。 However it's impractical in most scenarios to combine or divide files up so it's just easier to create a partition per file.
但是,在大多数情况下,将文件组合或分割是不切实际的,因此每个文件创建分区更容易。 Both of these scenarios are dependent upon the data you're trying to divide up as to whether the gridSize is useful or not.
这两种情况都取决于您试图划分的数据,以确定gridSize是否有用。
If this isn't the case, how can I achieve this?
如果不是这种情况,我该如何实现? That is, how can I re-use a task executor and separately define the number of partitions that can run parallel for that step and the number of partitions that actually get created?
也就是说,如何重新使用任务执行程序并单独定义可以为该步骤并行运行的分区数以及实际创建的分区数?
If you're re-using a TaskExecutor
, you may not be able to since that TaskExecutor
may be doing other things. 如果您正在重新使用
TaskExecutor
,那么您可能无法执行TaskExecutor
,因为TaskExecutor
可能正在执行其他操作。 I wonder why you'd re-use one given the relatively low overhead of creating one dedicated (you can even make it step scoped so it's only created when the partitioned step is running). 我想知道为什么你会重新使用一个,因为创建一个专用的开销相对较低(你甚至可以让它跨步作用,所以它只在分区步骤运行时创建)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.