如何优化此阵列抽取/下采样程序的内存访问模式/缓存未命中？

Question

我最近被问到一段代码来“就地”对数组进行抽取/下采样。 这个“抽取”函数采用一个int数组，并在索引i/2的数组中的偶数索引i处存储一个条目。 它为数组中的所有条目执行此操作。

这会将原始数组中所有偶数索引条目移动到数组的前半部分。 然后可以将数组的其余部分初始化为0.总体结果是一个数组，它保留原始数组中的所有偶数索引条目（通过将它们移动到前半部分），并且数组的后半部分为0。显然用于在信号处理中对信号进行下采样。

代码看起来像这样：

void decimate (vector<int>& a) {
   int sz = a.size();
   for (int i =0; i < sz; i++) {
     if (i%2 == 0) {
        a[i/2] = a[i];
     }
    }
    for (int i =(sz-1)/2; i < sz; i++) a[i] = 0;
}

在建议将某些变量保留在寄存器中的基本改进之后，我找不到任何进一步优化它的方法，但不确定是否无法完成。

有没有办法可以优化循环中的内存访问模式，以获得更好的缓存性能？ 或者任何其他方法来优化将阵列压缩/下采样到上半部分的主要复制操作？ （例如，通过支持它的平台进行矢量化）

   for (int i =0; i < sz; i++) {
     if (i%2 == 0) {
        a[i/2] = a[i];
     }
    }

是否有任何循环变换（例如平铺/条带挖掘）可以为这种抽取循环带来高效的代码？

编辑：在下面的答案中提出了几种不同的方法，似乎利用memset / fill或指针算法来提高速度效率。这个问题主要集中在是否有明确定义的循环变换可以显着改善局部性或缓存未命中（例如，如果它是一个带有两个循环的循环嵌套，可能会考虑循环平铺以优化缓存未命中）

Answer 1

你有一个像这样的数组：

0 1 2 3 4 5 6 7 8 9

你想最终得到这个：

0 2 4 6 8 0 0 0 0 0

我这样做：

void decimate (vector<int>& a) {
  size_t slow = 1, fast = 2;

  // read the first half, write the first quarter
  size_t stop = (a.size()+1)/2;
  while (fast < stop) {
    a[slow++] = a[fast];
    fast += 2;
  }

  // read and clear the second half, write the second quarter
  stop = a.size();
  while (fast < stop) {
    a[slow++] = a[fast];
    a[fast++] = 0;
    a[fast++] = 0;
  }

  // clean up (only really needed when length is even)
  a[slow] = 0;
}

在我的系统上，这比原始版本快大约20％。

现在由您来测试，让我们知道您的系统是否更快！

Answer 2

这是一个使用指针算法和placement new的版本，它使用了std :: vector在内部使用连续内存布局的事实：

void down_sample(std::vector<int> & v){ 
    int * begin = &v[0];
    int * stop =  begin + v.size();
    int * position = begin + 2;
    int * half_position = begin +1;
    while( position < stop){
        *half_position = *position;
        ++half_position;
        position += 2;
    }
    size_t size = v.size()/2;
    int * a = new (half_position) int[size]();
}

在我的机器上，这个代码运行速度是你的禁用优化的3倍，比在gcc7.2上用-o3编译的版本快30％。 我测试了这个矢量大小为200万个元素。

我认为在您的版本行中：

for (int i =(sz-1)/2; i < sz; i++) a[i] = 0;

应该

for (int i =(sz-1)/2 + 1; i < sz; i++) a[i] = 0;

否则会将太多元素设置为零。

考虑到John Zwinck的问题，我用memset和std :: fill进行了一些快速测试，而不是放置new。

结果如下：

n = 20000000
compiled with -o0
orginal 0.111396 seconds
mine    0.0327938 seconds
memset  0.0303007 seconds
fill    0.0507268 seconds

compiled with -o3
orginal 0.0181994 seconds
mine    0.014135 seconds
memset  0.0141561 seconds
fill    0.0138893 seconds

n = 2000
compiled with -o0
orginal 3.0119e-05 seconds
mine    9.171e-06 seconds
memset  9.612e-06 seconds
fill    1.3868e-05 seconds

compiled with -o3
orginal 5.404e-06 seconds
mine    2.105e-06 seconds
memset  2.04e-06 seconds
fill    1.955e-06 seconds

n= 500000000 (with -o3)
mine=     0,350732
memeset = 0.349054  
fill =    0.352398

似乎memset在大型向量上稍微快一点，std ::在较小的向量上加快一点点。 但差别很小。

Answer 3

我的一个传递版本decimate() ：

void decimate (std::vector<int>& a) {
    const std::size_t sz = a.size();
    const std::size_t half = sz / 2;

    bool size_even = ((sz % 2) == 0);

    std::size_t index = 2;
    for (; index < half; index += 2) {
        a[index/2] = a[index];
    }

    for (; index < sz; ++index) {
        a[(index+1)/2] = a[index];
        a[index] = 0;
    }

    if (size_even && (half < sz)) {
        a[half] = 0;
    }
}

并测试它：

#include <vector>
#include <iostream>
#include <cstddef>

void decimate(std::vector<int> &v);

void print(std::vector<int> &a) {
    std::cout << "{";
    bool f = false;

    for(auto i:a) {
        if (f) std::cout << ", ";
        std::cout << i;
        f = true;
    }
    std::cout << "}" << std::endl;
}

void test(std::vector<int> v1, std::vector<int> v2) {
    auto v = v1;
    decimate(v1);

    bool ok = true;

    for(std::size_t i = 0; i < v1.size(); ++i) {
        ok = (ok && (v1[i] == v2[i]));
    }

    if (ok) {
        print(v);
        print(v1);
    } else {
        print(v);
        print(v1);
        print(v2);
    }
    std::cout << "--------- " << (ok?"ok":"fail") << "\n" << std::endl;
}

int main(int, char**)
{
    test({},
        {});

    test({1},
        {1});

    test({1, 2},
        {1, 0});

    test({1, 2, 3},
        {1, 3, 0});

    test({1, 2, 3, 4},
        {1, 3, 0, 0});

    test({1, 2, 3, 4, 5},
        {1, 3, 5, 0, 0});

    test({1, 2, 3, 4, 5, 6},
        {1, 3, 5, 0, 0, 0});

    test({1, 2, 3, 4, 5, 6, 7},
        {1, 3, 5, 7, 0, 0, 0});

    test({1, 2, 3, 4, 5, 6, 7, 8},
        {1, 3, 5, 7, 0, 0, 0, 0});

    test({1, 2, 3, 4, 5, 6, 7, 8, 9},
        {1, 3, 5, 7, 9, 0, 0, 0, 0});

    test({1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
        {1, 3, 5, 7, 9, 0, 0, 0, 0, 0});

    test({1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11},
        {1, 3, 5, 7, 9, 11, 0, 0, 0, 0, 0});

    return 0;
}

Answer 4

如果你之后将它设置为零，请不要去sz。

如果sz甚至是转到sz / 2，如果不是（sz-1）/ 2。

for (int i =0; i < sz_half; i++) 
        a[i] = a[2*i];

Answer 5

我比较了这里给出的所有答案。 我使用了intel编译器icc版本15.0.3。 使用优化级别O3。

Orig: Time difference [micro s] = 79506
JohnZwinck: Time difference [micro s] = 69127   
Hatatister: Time difference [micro s] = 79838
user2807083: Time difference [micro s] = 80000
Schorsch312: Time difference [micro s] = 84491

所有时间都指向长度为100000000的向量。

#include <vector>
#include <cstddef>
#include <iostream>
#include <chrono>

const int MAX = 100000000;

void setup(std::vector<int> & v){
    for (int i = 0 ; i< MAX; i++) {
        v.push_back(i);
    }
}


void checkResult(std::vector<int> & v) {
    int half_length;
    if (MAX%2==0)
        half_length = MAX/2;
    else
        half_length = MAX-1/2;

    for (int i = 0 ; i< half_length; i++) {
        if (v[i] != i*2)
            std::cout << "Error: v[i]="  << v[i] << " but should be "  <<     2*i <<  "\n";
    }

    for (int i = half_length+1; i< MAX; i++) {
        if (v[i] != 0)
            std::cout << "Error: v[i]="  << v[i] << " but should be 0 \n";
    }
}

void down_sample(){
    std::vector<int> v;
    setup(v);

    auto start_time = std::chrono::steady_clock::now();

    int * begin = &v[0];
    int * stop =  begin + v.size();
    int * position = begin + 2;
    int * half_position = begin +1;
    while( position < stop){
        *half_position = *position;
        ++half_position;
        position += 2;
    }
    size_t size = v.size()/2;
    int * a = new (half_position) int[size]();

    auto duration = std::chrono::steady_clock::now() - start_time;
    std::cout << "Orig: Time difference [micro s] = " << std::chrono::duration_cast<std::chrono::microseconds>(duration).count() <<std::endl;
    checkResult(v);
}

void down_sample_JohnZwinck () {
    std::vector<int> v;
    setup(v);

    auto start_time = std::chrono::steady_clock::now();

    size_t slow = 1, fast = 2;

    // read the first half, write the first quarter
    size_t stop = (v.size()+1)/2;
    while (fast < stop) {
        v[slow++] = v[fast];
        fast += 2;
    }

    // read and clear the second half, write the second quarter
    stop = v.size();
    while (fast < stop) {
        v[slow++] = v[fast];
        v[fast++] = 0;
        v[fast++] = 0;
    }

    // clean up (only really needed when length is even)
    v[slow] = 0;

    auto duration = std::chrono::steady_clock::now() - start_time;
    std::cout << "JohnZwinck: Time difference [micro s] = " << std::chrono::duration_cast<std::chrono::microseconds>(duration).count() <<std::endl;
    checkResult(v);

}

void down_sample_Schorsch312(){ 
    std::vector<int> v;
    setup(v);

    auto start_time = std::chrono::steady_clock::now();
    int half_length;

    if (v.size()%2==0)
        half_length = MAX/2;
    else
        half_length = MAX-1/2;

    for (int i=0; i < half_length; i++) 
        v[i] = v[2*i];
    for (int i=half_length+1; i< MAX; i++) 
        v[i]=0;

    auto duration = std::chrono::steady_clock::now() - start_time;
std::cout << "Schorsch312: Time difference [micro s] = " << std::chrono::duration_cast<std::chrono::microseconds>(duration).count() <<std::endl;
}

void down_sample_Hatatister(){ 
    std::vector<int> v;
    setup(v);

    auto start_time = std::chrono::steady_clock::now();

    int * begin = &v[0];
    int * stop =  begin + v.size();
    int * position = begin + 2;
    int * half_position = begin +1;

    while( position < stop){
        *half_position = *position;
        ++half_position;
        position += 2;
    }
    size_t size = v.size()/2;
    int * a = new (half_position) int[size]();
    auto duration = std::chrono::steady_clock::now() - start_time;
    std::cout << "Hatatister: Time difference [micro s] = " << std::chrono::duration_cast<std::chrono::microseconds>(duration).count() <<std::endl;

    checkResult(v);
}

void down_sample_user2807083 () {
    std::vector<int> v;
    setup(v);

    auto start_time = std::chrono::steady_clock::now();

    const std::size_t sz = v.size();
    const std::size_t half = sz / 2;

    bool size_even = ((sz % 2) == 0);

    std::size_t index = 2;

    for (; index < half; index += 2) {
        v[index/2] = v[index];
    }

    for (; index < sz; ++index) {
        v[(index+1)/2] = v[index];
        v[index] = 0;
    }

    if (size_even && (half < sz)) {
        v[half] = 0;
    }
    auto duration = std::chrono::steady_clock::now() - start_time;
    std::cout << "user2807083: Time difference [micro s] = " << std::chrono::duration_cast<std::chrono::microseconds>(duration).count() <<std::endl;

    checkResult(v);

}

int main () {
    down_sample();
    down_sample_JohnZwinck ();
    down_sample_Schorsch312();
    down_sample_Hatatister();
    down_sample_user2807083();
}

如何优化此阵列抽取/下采样程序的内存访问模式/缓存未命中？

问题描述

5 个解决方案

解决方案1
4 2017-09-10 09:06:16

解决方案2
3 2017-09-10 13:02:18

解决方案3
1 2017-09-10 08:43:29

解决方案4
0 2017-09-10 09:43:04

解决方案5
0 2017-09-11 08:17:52

如何优化此阵列抽取/下采样程序的内存访问模式/缓存未命中？

问题描述

5 个解决方案

解决方案1 4 2017-09-10 09:06:16

解决方案2 3 2017-09-10 13:02:18

解决方案3 1 2017-09-10 08:43:29

解决方案4 0 2017-09-10 09:43:04

解决方案5 0 2017-09-11 08:17:52

解决方案1
4 2017-09-10 09:06:16

解决方案2
3 2017-09-10 13:02:18

解决方案3
1 2017-09-10 08:43:29

解决方案4
0 2017-09-10 09:43:04

解决方案5
0 2017-09-11 08:17:52