I would like to improve the performance of this code using AVX

Question

I profiled my code and the most expensive part of the code is the loop included in the post. I want to improve the performance of this loop using AVX. I have tried manually unrolling the loop and, while that does improve performance, the improvements are not satisfactory.

int N = 100000000;
int8_t* data = new int8_t[N];
for(int i = 0; i< N; i++) { data[i] = 1 ;}
std::array<float, 10> f  = {1,2,3,4,5,6,7,8,9,10};
std::vector<float> output(N, 0);
int k = 0;
for (int i = k; i < N; i = i + 2) {
    for (int j = 0; j < 10; j++, k = j + 1) {
        output[i] += f[j] * data[i - k];
        output[i + 1] += f[j] * data[i - k + 1];
    }
}

Could I have some guidance on how to approach this.

Answer 1

I would assume that data is a large input array of signed bytes, and f is a small array of floats of length 10, and output is the large output array of floats. Your code goes out of bounds for the first 10 iterations by i , so I will start i from 10 instead. Here is a clean version of the original code:

int s = 10;
for (int i = s; i < N; i += 2) {
    for (int j = 0; j < 10; j++) {
        output[i]   += f[j] * data[i-j-1];
        output[i+1] += f[j] * data[i-j];
    }
}

As it turns out, processing two iterations by i does not change anything, so we simplify it further to:

for (int i = s; i < N; i++)
    for (int j = 0; j < 10; j++)
        output[i] += f[j] * data[i-j-1];

This version of code (along with declarations of input/output data) should have been present in the question itself, without others having to clean/simplify the mess.

Now it is obvious that this code applies one-dimensional convolution filter , which is a very common thing in signal processing. For instance, it can by computed in Python using numpy.convolve function. The kernel has very small length 10, so Fast Fourier Transform won't provide any benefits compared to bruteforce approach. Given that the problem is well-known, you can read a lot of articles on vectorizing small-kernel convolution. I will follow the article by hgomersall .

First, let's get rid of reverse indexing. Obviously, we can reverse the kernel before running the main algorithm. After that, we have to compute the so-called cross-correlation instead of convolution. In simple words, we move the kernel array along the input array, and compute the dot product between them for every possible offset.

std::reverse(f.data(), f.data() + 10);
for (int i = s; i < N; i++) {
    int b = i-10;
    float res = 0.0;
    for (int j = 0; j < 10; j++)
        res += f[j] * data[b+j];
    output[i] = res;
}

In order to vectorize it, let's compute 8 consecutive dot products at once. Recall that we can pack eight 32-bit float numbers into one 256-bit AVX register. We will vectorize the outer loop by i, which means that:

The loop by i will be advanced by 8 every iteration.
Every value inside the outer loop turns into a 8-element pack, such that k-th element of the pack holds this value for (i+k)-th iteration of the outer loop from the scalar version.

Here is the resulting code:

//reverse the kernel
__m256 revKernel[10];
for (size_t i = 0; i < 10; i++)
    revKernel[i] = _mm256_set1_ps(f[9-i]); //every component will have same value
//note: you have to compute the last 16 values separately!
for (size_t i = s; i + 16 <= N; i += 8) {
    int b = i-10;
    __m256 res = _mm256_setzero_ps();
    for (size_t j = 0; j < 10; j++) {
        //load: data[b+j], data[b+j+1], data[b+j+2], ..., data[b+j+15]
        __m128i bytes = _mm_loadu_si128((__m128i*)&data[b+j]);
        //convert first 8 bytes of loaded 16-byte pack into 8 floats
        __m256 floats = _mm256_cvtepi32_ps(_mm256_cvtepi8_epi32(bytes));
        //compute res = res + floats * revKernel[j] elementwise
        res = _mm256_fmadd_ps(revKernel[j], floats, res);
    }
    //store 8 values packed in res into: output[i], output[i+1], ..., output[i+7]
    _mm256_storeu_ps(&output[i], res);
}

For 100 millions of elements, this code takes about 120 ms on my machine, while the original scalar implementation took 850 ms. Beware: I have Ryzen 1600 CPU, so results on Intel CPUs may be somewhat different.

Now if you really want to unroll something, the inner loop by 10 kernel elements is the perfect place. Here is how it is done:

__m256 revKernel[10];
for (size_t i = 0; i < 10; i++)
    revKernel[i] = _mm256_set1_ps(f[9-i]);
for (size_t i = s; i + 16 <= N; i += 8) {
    size_t b = i-10;
    __m256 res = _mm256_setzero_ps();
    #define DOIT(j) {\
        __m128i bytes = _mm_loadu_si128((__m128i*)&data[b+j]); \
        __m256 floats = _mm256_cvtepi32_ps(_mm256_cvtepi8_epi32(bytes)); \
        res = _mm256_fmadd_ps(revKernel[j], floats, res); \
    }
    DOIT(0);
    DOIT(1);
    DOIT(2);
    DOIT(3);
    DOIT(4);
    DOIT(5);
    DOIT(6);
    DOIT(7);
    DOIT(8);
    DOIT(9);
    _mm256_storeu_ps(&output[i], res);
}

It takes 110 ms on my machine (slightly better that the first vectorized version).

The simple copy of all elements (with conversion from bytes to floats) takes 40 ms for me, which means that this code is not memory-bound yet, and there is still some room for improvement left.

I would like to improve the performance of this code using AVX

Question

1 answers

solution1
5 ACCPTED 2019-07-20 05:27:47

I would like to improve the performance of this code using AVX

Question

1 answers

solution1 5 ACCPTED 2019-07-20 05:27:47

solution1
5 ACCPTED 2019-07-20 05:27:47