简体   繁体   中英

How can I speed-up this loop (in C)?

I'm trying to parallelize a convolution function in C. Here's the original function which convolves two arrays of 64-bit floats:

void convolve(const Float64 *in1,
              UInt32 in1Len,
              const Float64 *in2,
              UInt32 in2Len,
              Float64 *results)
{
    UInt32 i, j;

    for (i = 0; i < in1Len; i++) {
        for (j = 0; j < in2Len; j++) {
            results[i+j] += in1[i] * in2[j];
        }
    }
}

In order to allow for concurrency (without semaphores), I created a function that computes the result for a particular position in the results array:

void convolveHelper(const Float64 *in1,
                    UInt32 in1Len,
                    const Float64 *in2,
                    UInt32 in2Len,
                    Float64 *result,
                    UInt32 outPosition)
{
    UInt32 i, j;

    for (i = 0; i < in1Len; i++) {
        if (i > outPosition)
            break;
        j = outPosition - i;
        if (j >= in2Len)
            continue;
        *result += in1[i] * in2[j];
    }
}

The problem is, using convolveHelper slows down the code about 3.5 times (when running on a single thread).

Any ideas on how I can speed-up convolveHelper , while maintaining thread safety?

Convolutions in the time domain become multiplications in the Fourier domain. I suggest you grab a fast FFT library (like FFTW ) and use that. You'll go from O(n^2) to O(n log n).

Algorithmic optimizations nearly always beat micro-optimizations.

The most obvious thing that could help would be to pre-compute the starting and ending indices of the loop, and remove the extra tests on i and j (and their associated jumps). This:

for (i = 0; i < in1Len; i++) {
   if (i > outPosition)
     break;
   j = outPosition - i;
   if (j >= in2Len)
     continue;
   *result += in1[i] * in2[j];
}

could be rewritten as:

UInt32 start_i = (in2Len < outPosition) ? outPosition - in2Len + 1 : 0;
UInt32 end_i = (in1Len < outPosition) ? in1Len : outPosition + 1;

for (i = start_i; i < end_i; i++) {
   j = outPosition - i;
   *result += in1[i] * in2[j];
}

This way, the condition j >= in2Len is never true, and the loop test is essentially the combination of the tests i < in1Len and i < outPosition .

In theory you also could get rid of the assignment to j and turn i++ into ++i , but the compiler is probably doing those optimizations for you already.

  • Instead of the two if statements in the loop, you can calculate the correct minimum/maximum values for i before the loop.

  • You're calculating each result position separately. Instead, you can split the results array into blocks and have each thread calculate a block. The calculation for a block will look like the convolve function.

Unless your arrays are very big, using a thread is unlikely to actually help much, since the overhead of starting a thread will be greater than the cost of the loops. However, let's assume that your arrays are large, and threading is a net win. In that case, I'd do the following:

  • Forget your current convolveHelper , which is too complicated and won't help much.
  • Split the interior of the loop into a thread function. Ie just make

     for (j = 0; j < in2Len; j++) { results[i+j] += in1[i] * in2[j]; } 

into its own function that takes i as a parameter along with everything else.

  • Have the body of convolve simply launch threads. For maximum efficiency, use a semaphore to make sure that you never create more threads than you have cores.

Answer lies in Simple Math & NOT multi-threading (UPDATED)


Here's why...

consider a b + a c

U can optimise it as a*(b+c) (one multimplication less)

In ur case there are in2Len unnecessary multiplications in the inner-loop . Which can be eliminated.

Hence, modifying the code as follows should give us the reqd convolution:

( NOTE: The following code returns circular-convolution which must be unfolded to obtain the linear-convolution result.

void convolve(const Float64 *in1,
              UInt32 in1Len,
              const Float64 *in2,
              UInt32 in2Len,
              Float64 *results)
{
    UInt32 i, j;

    for (i = 0; i < in1Len; i++) {

        for (j = 0; j < in2Len; j++) {
            results[i+j] += in2[j];
        }

        results[i] = results[i] * in1[i];

    }
}

This should give U the max performance jump more than anything else. Try it our and see!!

GoodLUCK!!

CVS @ 2600Hertz

I finally figured out how to correctly precompute the start/end indexes (a suggestion given by both Tyler McHenry and interjay ):

if (in1Len > in2Len) {
    if (outPosition < in2Len - 1) {
        start = 0;
        end = outPosition + 1;
    } else if (outPosition >= in1Len) {
        start = 1 + outPosition - in2Len;
        end = in1Len;
    } else {
        start = 1 + outPosition - in2Len;
        end = outPosition + 1;
    }
} else {
    if (outPosition < in1Len - 1) {
        start = 0;
        end = outPosition + 1;
    } else if (outPosition >= in2Len) {
        start = 1 + outPosition - in2Len;
        end = in1Len;
    } else {
        start = 0;
        end = in1Len;
    }
}

for (i = start; i < end; i++) {
    *result = in1[i] * in2[outPosition - i];
}

Unfortunately, precomputing the indexes produces no noticeable decrease in execution time :(

Let the convolve helper work on larger sets, calculating multiple results, using a short outer loop.

The key in parallelization is to find a good balance between the distribution of work between threads. Do no use more threads than the number of CPU cores.

Split the work evenly between all threads. With this kind of problem, the complexity of each threads work should be the same.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM