CUDA An algorithm to partition an array into blocks based on sum of elements

Question

Here is the problem: I have an array containing random numbers. I am supposed to partition them into blocks, such that in each block, sum of all elements is no bigger than a given N. I thought I can simply solve this with scan, but there is a nasty bug that I don't know how to solve efficiently

So for example, I try as follow: N=8 , the array is:

{2, 3, 1, 4, 4, 1, 6}

Performing an inclusive sum scan:

{2, 5, 6, 10, 14, 15, 21}

and then do a simple integer division by N=8 and result in the following partition:

{0, 0, 0, 1, 1, 1, 2}

And then I realized there is a bug that the sum of all elements in the second block is 4+4+1=9 instead of 8 , because by using the integer division, I assumes that the sum of all elements in the first block has to be 8.

The correct partition is supposed to be:

{0, 0, 0, 1, 1, 2, 2}

I tried to loop through the list and re-partition the boundary points, but my parallel implementation becomes slower than the serial implementation. Do you happen to know an efficient parallel algorithm for this problem?

Answer 1

Instead of starting with a scan, start by computing for each node n(i) in parallel what is the index of the following node n(j) which concludes a subsequence starting at n(i) and having sum less than or equal to 8. (step 1)

That is a fairly short loop operated by each thread in parallel which will give you a sequence of indexes.

{2, 3, 1, 4, 4, 1, 6}

{2, 3, 3, 4, 5, 6, 6} (end of step 1)

Then parallel traverse the linked list starting at node 0; consider successor of 0 the node following the index computed for 0 in step 1;basically this is the linked list of starting point of the bins in the solution.

[Compute in parallel within log(n) steps and n log(n) work a matrix that at position (i,k) gives for node i the node that is 2^k hops away from node i.

{0, 1, 2, 3, 4, 5, 6}

3, 4, 4, 5, 6, -, -

5, 6, 6, -, -, -, -

-, -, -, -, -, -, -

Compute the position of each node in parallel starting with node 0 and awaking successively more nodes for log(n) overall parallel steps.

0,-,-,1,-,2,-

Scatter and get list of starting points] (step 2)

{0, 3, 5}

Now in parallel from this sequence scatter a new sequence of 0's and 1's as below; start with a 0. (step 3)

{0, 0, 0, 1, 0, 1, 0}

Finally apply inclusive scan to obtain your result. (step 4)

In your example:

{2, 3, 1, 4, 4, 1, 6}

{2, 3, 3, 4, 5, 6, 6} (step 1)

{0, 3, 5} (step 2)

{0, 0, 0, 1, 0, 1, 0} (step 3)

{0, 0, 0, 1, 1, 2, 2} (step 4)

If you do not have 0's in the original sequence then the initail loop has constant steps and linear work, so you pay the list traversal that is log(n) steps and a scan. If you have zeros in the original sequence, but the numbers are randomly chosen natural numbers, then the probability of having many long sequences of zeros creating bad worst cases on impacted SM's is still fairly low. Constants much larger than 8 should not make much difference as long as natural numbers are (uniformly) randomly chosen; in practice, shouldn't be a problem as well that 8 be part of the input, for similar reasoning.

If you have negative numbers involved, than this solution as is is not feasible, but might be a step toward an improved general solution.

I can see that there is space for a simpler solution, I have put together some building blocks that are known. Still the solution as far as I understand is correct and practically feasible, uses known patterns, and we at least know that there is one.

CUDA An algorithm to partition an array into blocks based on sum of elements

Question

1 answers

solution1
1 2014-06-14 13:25:56

CUDA An algorithm to partition an array into blocks based on sum of elements

Question

1 answers

solution1 1 2014-06-14 13:25:56

solution1
1 2014-06-14 13:25:56