CUDA一种基于元素总数将数组划分为块的算法

Question

Here is the problem: I have an array containing random numbers. 这里是问题：我有一个包含随机数的数组。 I am supposed to partition them into blocks, such that in each block, sum of all elements is no bigger than a given N. I thought I can simply solve this with scan, but there is a nasty bug that I don't know how to solve efficiently 我应该将它们划分为多个块，以便在每个块中，所有元素的总和不大于给定N。有效地解决

So for example, I try as follow: N=8 , the array is: 因此，例如，我尝试如下： N=8 ，数组为：

{2, 3, 1, 4, 4, 1, 6}

Performing an inclusive sum scan: 执行包含和扫描：

{2, 5, 6, 10, 14, 15, 21}

and then do a simple integer division by N=8 and result in the following partition: 然后将其简单地除以N=8 ，得到以下分区：

{0, 0, 0, 1, 1, 1, 2}

And then I realized there is a bug that the sum of all elements in the second block is 4+4+1=9 instead of 8 , because by using the integer division, I assumes that the sum of all elements in the first block has to be 8. 然后我意识到存在一个错误，第二个块中所有元素的总和是4+4+1=9而不是8 ，因为通过使用整数除法，我假设第一个块中所有元素的总和为成为8。

The correct partition is supposed to be: 正确的分区应该是：

{0, 0, 0, 1, 1, 2, 2}

I tried to loop through the list and re-partition the boundary points, but my parallel implementation becomes slower than the serial implementation. 我试图遍历列表并重新划分边界点，但是并行实现比串行实现慢。 Do you happen to know an efficient parallel algorithm for this problem? 您是否碰巧知道解决此问题的有效并行算法？

Answer 1

Instead of starting with a scan, start by computing for each node n(i) in parallel what is the index of the following node n(j) which concludes a subsequence starting at n(i) and having sum less than or equal to 8. (step 1) 从并行计算开始，而不是从扫描开始，首先为每个节点n（i）计算下一个节点n（j）的索引，该索引得出一个从n（i）开始且总和小于或等于8的子序列。。（第1步）

That is a fairly short loop operated by each thread in parallel which will give you a sequence of indexes. 这是每个线程并行操作的相当短的循环，它将为您提供一系列索引。

{2, 3, 1, 4, 4, 1, 6} {2，3，1，4，4，1，6}

{2, 3, 3, 4, 5, 6, 6} (end of step 1) {2、3、3、4、5、6、6}（步骤1结束）

Then parallel traverse the linked list starting at node 0; 然后并行遍历从节点0开始的链表； consider successor of 0 the node following the index computed for 0 in step 1;basically this is the linked list of starting point of the bins in the solution. 考虑在步骤1中为0计算的索引之后的节点为0的后继节点；基本上，这是解决方案中bin起点的链接列表。

[Compute in parallel within log(n) steps and n log(n) work a matrix that at position (i,k) gives for node i the node that is 2^k hops away from node i. [在log（n）个步骤和n个log（n）中并行计算一个矩阵，该矩阵在位置（i，k）处为节点i提供了距离节点i 2跳跃点的节点。

{0, 1, 2, 3, 4, 5, 6} {0，1，2，3，4，5，6}

3, 4, 4, 5, 6, -, - 3，4，4，5，6，---

5, 6, 6, -, -, -, - 5，6，6，-，-，-，-

-, -, -, -, -, -, - -，-，-，-，-，-，-

Compute the position of each node in parallel starting with node 0 and awaking successively more nodes for log(n) overall parallel steps. 从节点0开始并行计算每个节点的位置，并依次唤醒更多节点以进行log（n）总体并行步骤。

0,-,-,1,-,2,- 0， - ， - ， - 1， - 2， -

Scatter and get list of starting points] (step 2) 分散并获取起点列表]（步骤2）

{0, 3, 5} {0，3，5}

Now in parallel from this sequence scatter a new sequence of 0's and 1's as below; 现在从该序列并行散布一个新的0和1序列，如下所示； start with a 0. (step 3) 从0开始。（第3步）

{0, 0, 0, 1, 0, 1, 0} {0，0，0，1，0，1，0}

Finally apply inclusive scan to obtain your result. 最后应用全面扫描以获得您的结果。 (step 4) （第四步）

In your example: 在您的示例中：

{2, 3, 1, 4, 4, 1, 6} {2，3，1，4，4，1，6}

{2, 3, 3, 4, 5, 6, 6} (step 1) {2、3、3、4、5、6、6}（步骤1）

{0, 3, 5} (step 2) {0，3，5}（步骤2）

{0, 0, 0, 1, 0, 1, 0} (step 3) {0，0，0，1，0，1，0}（步骤3）

{0, 0, 0, 1, 1, 2, 2} (step 4) {0，0，0，1，1，2，2}（步骤4）

If you do not have 0's in the original sequence then the initail loop has constant steps and linear work, so you pay the list traversal that is log(n) steps and a scan. 如果原始序列中没有0，则initail循环具有恒定的步长和线性工作量，因此您需要进行遍历为log（n）步和扫描的列表遍历。 If you have zeros in the original sequence, but the numbers are randomly chosen natural numbers, then the probability of having many long sequences of zeros creating bad worst cases on impacted SM's is still fairly low. 如果原始序列中有零，但数字是随机选择的自然数，则具有多个长零序列会在受影响的SM上造成最坏情况的可能性仍然很低。 Constants much larger than 8 should not make much difference as long as natural numbers are (uniformly) randomly chosen; 只要自然数是随机选择的，那么大于8的常数就不会有太大区别。 in practice, shouldn't be a problem as well that 8 be part of the input, for similar reasoning. 实际上，出于类似的原因，将8作为输入的一部分也不应该是一个问题。

If you have negative numbers involved, than this solution as is is not feasible, but might be a step toward an improved general solution. 如果涉及负数，则此解决方案不可行，但可能会朝着改进的常规解决方案迈出一步。

I can see that there is space for a simpler solution, I have put together some building blocks that are known. 我可以看到有一个更简单的解决方案的空间，我整理了一些已知的构件。 Still the solution as far as I understand is correct and practically feasible, uses known patterns, and we at least know that there is one. 据我所知，该解决方案仍然是正确且切实可行的，使用了已知的模式，而且至少我们知道有一个。

CUDA一种基于元素总数将数组划分为块的算法

问题描述

1 个解决方案

解决方案1
1 2014-06-14 13:25:56

CUDA一种基于元素总数将数组划分为块的算法

问题描述

1 个解决方案

解决方案1 1 2014-06-14 13:25:56

解决方案1
1 2014-06-14 13:25:56