简体   繁体   English

如何加速此循环(在C中)?

[英]How can I speed-up this loop (in C)?

I'm trying to parallelize a convolution function in C. Here's the original function which convolves two arrays of 64-bit floats: 我正在尝试并行化C中的卷积函数。这是原始函数,它会卷积两个64位浮点数组:

void convolve(const Float64 *in1,
              UInt32 in1Len,
              const Float64 *in2,
              UInt32 in2Len,
              Float64 *results)
{
    UInt32 i, j;

    for (i = 0; i < in1Len; i++) {
        for (j = 0; j < in2Len; j++) {
            results[i+j] += in1[i] * in2[j];
        }
    }
}

In order to allow for concurrency (without semaphores), I created a function that computes the result for a particular position in the results array: 为了允许并发(没有信号量),我创建了一个函数来计算results数组中特定位置的results

void convolveHelper(const Float64 *in1,
                    UInt32 in1Len,
                    const Float64 *in2,
                    UInt32 in2Len,
                    Float64 *result,
                    UInt32 outPosition)
{
    UInt32 i, j;

    for (i = 0; i < in1Len; i++) {
        if (i > outPosition)
            break;
        j = outPosition - i;
        if (j >= in2Len)
            continue;
        *result += in1[i] * in2[j];
    }
}

The problem is, using convolveHelper slows down the code about 3.5 times (when running on a single thread). 问题是,使用convolveHelper将代码减慢约3.5倍(在单个线程上运行时)。

Any ideas on how I can speed-up convolveHelper , while maintaining thread safety? 关于如何在保持线程安全的同时加快convolveHelper任何想法?

Convolutions in the time domain become multiplications in the Fourier domain. 时域中的卷积在傅立叶域中成为乘法。 I suggest you grab a fast FFT library (like FFTW ) and use that. 我建议你抓住一个快速FFT库(如FFTW )并使用它。 You'll go from O(n^2) to O(n log n). 你将从O(n ^ 2)到O(n log n)。

Algorithmic optimizations nearly always beat micro-optimizations. 算法优化几乎总是优于微优化。

The most obvious thing that could help would be to pre-compute the starting and ending indices of the loop, and remove the extra tests on i and j (and their associated jumps). 可能有帮助的最明显的事情是预先计算循环的起始和结束索引,并删除ij上的额外测试(及其相关的跳转)。 This: 这个:

for (i = 0; i < in1Len; i++) {
   if (i > outPosition)
     break;
   j = outPosition - i;
   if (j >= in2Len)
     continue;
   *result += in1[i] * in2[j];
}

could be rewritten as: 可以改写为:

UInt32 start_i = (in2Len < outPosition) ? outPosition - in2Len + 1 : 0;
UInt32 end_i = (in1Len < outPosition) ? in1Len : outPosition + 1;

for (i = start_i; i < end_i; i++) {
   j = outPosition - i;
   *result += in1[i] * in2[j];
}

This way, the condition j >= in2Len is never true, and the loop test is essentially the combination of the tests i < in1Len and i < outPosition . 这样,条件j >= in2Len永远不会成立,并且循环测试基本上是测试i < in1Leni < outPosition

In theory you also could get rid of the assignment to j and turn i++ into ++i , but the compiler is probably doing those optimizations for you already. 从理论上讲,你也可以摆脱对j的赋值并将i++转换为++i ,但编译器可能已经为你做了那些优化。

  • Instead of the two if statements in the loop, you can calculate the correct minimum/maximum values for i before the loop. 您可以在循环之前计算i的正确最小值/最大值,而不是循环中的两个if语句。

  • You're calculating each result position separately. 您将分别计算每个结果位置。 Instead, you can split the results array into blocks and have each thread calculate a block. 相反,您可以将results数组拆分为块,并让每个线程计算一个块。 The calculation for a block will look like the convolve function. 块的计算看起来像convolve函数。

Unless your arrays are very big, using a thread is unlikely to actually help much, since the overhead of starting a thread will be greater than the cost of the loops. 除非您的数组非常大,否则使用线程实际上不太可能有用,因为启动线程的开销将大于循环的开销。 However, let's assume that your arrays are large, and threading is a net win. 但是,让我们假设您的阵列很大,并且线程是一个净胜利。 In that case, I'd do the following: 在那种情况下,我会做以下事情:

  • Forget your current convolveHelper , which is too complicated and won't help much. 忘记你当前的convolveHelper ,这太复杂了,也无济于事。
  • Split the interior of the loop into a thread function. 将循环内部拆分为线程函数。 Ie just make 即只是

     for (j = 0; j < in2Len; j++) { results[i+j] += in1[i] * in2[j]; } 

into its own function that takes i as a parameter along with everything else. 进入它自己的函数,将i作为参数与其他所有东西一起使用。

  • Have the body of convolve simply launch threads. 有身体convolve只需启动线程。 For maximum efficiency, use a semaphore to make sure that you never create more threads than you have cores. 为了获得最大效率,请使用信号量以确保永远不会创建比核心更多的线程。

Answer lies in Simple Math & NOT multi-threading (UPDATED) 答案在于简单数学而不是多线程(更新)


Here's why... 这就是为什么......

consider a b + a c 考虑一个b + a c

U can optimise it as a*(b+c) (one multimplication less) U可以将其优化为*(b + c) (少一个多重复制)

In ur case there are in2Len unnecessary multiplications in the inner-loop . 在你的情况下, 在内循环中存在in2Len不必要的乘法。 Which can be eliminated. 哪个可以消除。

Hence, modifying the code as follows should give us the reqd convolution: 因此,如下修改代码应该给我们reqd卷积:

( NOTE: The following code returns circular-convolution which must be unfolded to obtain the linear-convolution result. 注意:以下代码返回循环卷积 ,必须展开循环卷积才能获得线性卷积结果。

void convolve(const Float64 *in1,
              UInt32 in1Len,
              const Float64 *in2,
              UInt32 in2Len,
              Float64 *results)
{
    UInt32 i, j;

    for (i = 0; i < in1Len; i++) {

        for (j = 0; j < in2Len; j++) {
            results[i+j] += in2[j];
        }

        results[i] = results[i] * in1[i];

    }
}

This should give U the max performance jump more than anything else. 这应该给U带来最大的性能跳跃。 Try it our and see!! 试试吧,看看!!

GoodLUCK!! 祝好运!!

CVS @ 2600Hertz CVS @ 2600Hertz

I finally figured out how to correctly precompute the start/end indexes (a suggestion given by both Tyler McHenry and interjay ): 我终于想出了如何正确预先计算开始/结束索引( Tyler McHenryinterjay提出的建议):

if (in1Len > in2Len) {
    if (outPosition < in2Len - 1) {
        start = 0;
        end = outPosition + 1;
    } else if (outPosition >= in1Len) {
        start = 1 + outPosition - in2Len;
        end = in1Len;
    } else {
        start = 1 + outPosition - in2Len;
        end = outPosition + 1;
    }
} else {
    if (outPosition < in1Len - 1) {
        start = 0;
        end = outPosition + 1;
    } else if (outPosition >= in2Len) {
        start = 1 + outPosition - in2Len;
        end = in1Len;
    } else {
        start = 0;
        end = in1Len;
    }
}

for (i = start; i < end; i++) {
    *result = in1[i] * in2[outPosition - i];
}

Unfortunately, precomputing the indexes produces no noticeable decrease in execution time :( 不幸的是,预先计算索引不会导致执行时间明显减少 :(

Let the convolve helper work on larger sets, calculating multiple results, using a short outer loop. 让convolve helper在更大的集合上工作,使用短外循环计算多个结果。

The key in parallelization is to find a good balance between the distribution of work between threads. 并行化的关键是在线程之间的工作分配之间找到一个很好的平衡点。 Do no use more threads than the number of CPU cores. 不要使用比CPU核心数更多的线程。

Split the work evenly between all threads. 在所有线程之间平均分配工作。 With this kind of problem, the complexity of each threads work should be the same. 有了这种问题,每个线程工作的复杂性应该是相同的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM