C中的快速2D卷积

Question

I'm trying to implement a convolutional neural network in Python. 我正在尝试用Python实现卷积神经网络。 Originally, I was using scipy.signal's convolve2d function to do the convolution, but it has a lot of overhead, and it would be faster to just implement my own algorithm in C and call it from python, since I know what my input looks like. 最初，我使用scipy.signal的convolve2d函数来进行卷积，但它有很多开销，在C中实现我自己的算法并从python中调用它会更快，因为我知道我的输入是什么样的。

I've implemented 2 functions: 我实现了2个功能：

Convolving a matrix with a non-separable kernel 使用不可分离的内核卷积矩阵
Convolving a matrix with a separable kernel (For now I've assumed python does the rank checking and splitting before passing it onto C) 用可分离的内核对矩阵进行卷积（现在我假设python在将它传递给C之前进行等级检查和拆分）

Neither of these functions has padding since I require dimensionality reduction. 由于我需要降低维数，因此这两个函数都没有填充。

Non-separable 2D Convolution 不可分离的2D卷积

// a - 2D matrix (as a 1D array), w - kernel
double* conv2(double* a, double* w, double* result)
{
    register double acc;
    register int i; 
    register int j;
    register int k1, k2;
    register int l1, l2;
    register int t1, t2;

    for(i = 0; i < RESULT_DIM; i++) 
    {
        t1 = i * RESULT_DIM; // loop invariants
        for(j = 0; j < RESULT_DIM; j++) 
        {   
            acc = 0.0;
            for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++)
            {
                t2 = k1 * FILTER_DIM;  // loop invariants
                for(l1 = FILTER_DIM - 1, l2 = 0; l1 >= 0; l1--, l2++)
                {
                    acc += w[t2 + l1] * a[(i + k2) * IMG_DIM + (j + l2)];
                }
            }
            result[t1 + j] = acc;
        }
    }

    return result;
}

Separable 2D Convolution 可分离的2D卷积

// a - 2D matrix, w1, w2 - the separated 1D kernels
double* conv2sep(double* a, double* w1, double* w2, double* result)
{
    register double acc;
    register int i; 
    register int j;
    register int k1, k2;
    register int t;
    double* tmp = (double*)malloc(IMG_DIM * RESULT_DIM * sizeof(double));

    for(i = 0; i < RESULT_DIM; i++) // convolve with w1 
    {
        t = i * RESULT_DIM;
        for(j = 0; j < IMG_DIM; j++)
        {
            acc = 0.0;
            for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++)
            {
                acc += w1[k1] * a[k2 * IMG_DIM + t + j];
            }
            tmp[t + j] = acc;
        }
    }

    for(i = 0; i < RESULT_DIM; i++) // convolve with w2
    {
        t = i * RESULT_DIM;
        for(j = 0; j < RESULT_DIM; j++)
        {
            acc = 0.0;
            for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++)
            {
                acc += w2[k1] * tmp[t + (j + k2)];
            }

            result[t + j] = acc;
        }
    }

    free(tmp);
    return result;
}

Compiling with gcc's -O3 flag and testing on a 2.7GHz Intel i7, using a 4000x4000 matrix and 5x5 kernel, I get respectively (avg of 5): 使用gcc的-O3标志进行编译并在2.7GHz Intel i7上进行测试，使用4000x4000矩阵和5x5内核，我分别得到（平均5）：

271.21900 ms
127.32000 ms

This is still a considerable improvement over scipy.signal's convolve2d which takes around 2 seconds for the same operation, but I need more speed since I'll be calling this function thousands of times. 与scipy.signal的convolve2d相比，这仍然是一个相当大的改进，对于相同的操作需要大约2秒钟，但我需要更多速度，因为我将调用此函数数千次。 Changing the data type to float isn't an option at the moment, even though it'd cause a considerable speedup. 将数据类型更改为浮动目前不是一种选择，即使它会导致相当大的加速。

Is there a way I can optimise these algorithms further? 有没有办法可以进一步优化这些算法？ Can I apply any cache tricks or routines to speed it up? 我可以应用任何缓存技巧或例程来加快速度吗？

Any suggestions would be appreciated. 任何建议，将不胜感激。

Answer 1

If you're running on x86 only then consider using SSE or AVX SIMD optimisation. 如果您只在x86上运行，请考虑使用SSE或AVX SIMD优化。 For double data the throughput improvement will be modest, but if you can switch to float then you may be able to get to around 4x improvement with SSE or 8x with AVX. 对于double数据，吞吐量的改善将是适度的，但如果你可以切换到float那么你可以使用SSE或使用AVX获得8倍的改进。 There are a number of questions and answers about this very topic on StackOverflow already from which you may be able to get some ideas on the implementation. 关于StackOverflow上的这个主题，有很多问题和答案，您可以从中获得有关实现的一些想法。 Alternatively there are also many libraries available which include high performance 2D convolution (filtering) routines, and these typically exploit SIMD for performance, eg Intel's IPP (commercial), or OpenCV (free). 另外，还有许多可用的库，包括高性能2D卷积（过滤）例程，这些通常利用SIMD来提高性能，例如Intel的IPP（商业）或OpenCV（免费）。

Another possibility is to exploit multiple cores - split your image into blocks and run each block in its own thread. 另一种可能性是利用多个核心 - 将图像分割成块并在其自己的线程中运行每个块。 Eg if you have a 4 core CPU then split your image into 4 blocks. 例如，如果您有一个4核CPU，则将图像分成4个块。 (See pthreads ). （见pthreads ）。

You can of course combine both of the above ideas, if you really want to fully optimise this operation. 如果您真的想要完全优化此操作，您当然可以结合上述两个想法。

Some small optimisations which you can apply to your current code, and to any future implementations (eg SIMD): 您可以应用于当前代码以及任何未来实现（例如SIMD）的一些小优化：

if your kernels are symmetric (or odd-symmetric) then you can reduce the number of operations by adding (subtracting) symmetric input values and performing one multiply rather than two 如果您的内核是对称的（或奇数对称的），那么您可以通过添加（减去）对称输入值并执行一次乘法而不是两次来减少操作次数
for the separable case, rather than allocating a full frame temporary buffer, consider using a "strip-mining" approach - allocate a smaller buffer, which is full width, but a relatively small number of rows, then process your image in "strips", alternately applying the horizontal kernel and the vertical kernel. 对于可分离的情况，不要分配一个完整的帧临时缓冲区，而是考虑使用“条带挖掘”方法 - 分配一个较小的缓冲区，这是一个全宽，但行数相对较少，然后以“条带”处理你的图像，交替应用水平内核和垂直内核。 The advantage of this is that you have a much more cache-friendly access pattern and a smaller memory footprint. 这样做的好处是，您具有更多缓存友好的访问模式和更小的内存占用。

A few comments on coding style: 关于编码风格的一些评论：

the register keyword has been redundant for many years, and modern compilers will emit a warning if you try to you use it - save yourself some noise (and some typing) by ditching it register关键字多年来一直是多余的，如果你试图使用它，现代编译器会发出警告 - 通过抛弃它来节省一些噪音（和一些打字）
casting the result of malloc in C is frowned upon - it's redundant and potentially dangerous . 在C中转换malloc的结果是不受欢迎的 - 这是多余的并且有潜在危险。
make any input parameters const (ie read-only) and use restrict for any parameters which can never be aliased (eg a and result ) - this can not only help to avoid programming errors (at least in the case of const ), but in some cases it can help the compiler to generate better optimised code (particularly in the case of potentially aliased pointers). 使任何输入参数const （即只读）并对任何永远不会混淆的参数使用restrict （例如a和result ） - 这不仅有助于避免编程错误（至少在const的情况下），而是在某些情况下，它可以帮助编译器生成更好的优化代码（特别是在潜在的别名指针的情况下）。

C中的快速2D卷积

问题描述

Non-separable 2D Convolution 不可分离的2D卷积

Separable 2D Convolution 可分离的2D卷积

1 个解决方案

解决方案1
2 已采纳 2016-06-29 16:24:01

C中的快速2D卷积

问题描述

Non-separable 2D Convolution 不可分离的2D卷积

Separable 2D Convolution 可分离的2D卷积

1 个解决方案

解决方案1 2 已采纳 2016-06-29 16:24:01

解决方案1
2 已采纳 2016-06-29 16:24:01