简体   繁体   English

更快地访问c ++数组中的随机元素

[英]faster access to random elements in c++ array

What is the fastest way access random (non-sequential) elements in an array if the access pattern is known beforehand? 如果事先知道访问模式,最快的访问数组中随机(非顺序)元素的方法是什么? The access is random for different needs at every step so rearranging the elements is expensive option. 对于每个步骤,访问都是针对不同需求的随机操作,因此重新排列元素是昂贵的选择。 The code below is represents important sample of the whole application. 下面的代码代表了整个应用程序的重要示例。

#include <iostream>
#include "chrono"
#include <cstdlib>

#define NN 1000000

struct Astr{
    double x[3], v[3];
    int i, j, k;
    long rank, p, q, r;
};


int main ()
{
    struct Astr *key;
    key = new Astr[NN];
    int ii, *sequence;
    sequence = new int[NN]; // access pattern is stored here
    float frac ;

    // create array of structs
    // create array for random numbers between 0 to NN to access 'key'
    for(int i=0; i < NN; i++){
        key[i].x[1] = static_cast<double>(i);
        key[i].p = static_cast<long>(i);
        frac = static_cast<float>(rand()) / static_cast<float>(RAND_MAX);
        sequence[i] = static_cast<int>(frac  * static_cast<float>(NN));
    }

    // part to check and improve
    // =========================================Random=======================================================
    std::chrono::high_resolution_clock::time_point TstartMain = std::chrono::high_resolution_clock::now();
    double tmp;
    long rnk;

    for(int j=0; j < 1000; j++)
    for(int i=0; i < NN; i++){
        ii = sequence[i];
        tmp = key[ii].x[1];
        rnk = key[ii].p;
        key[ii].x[1] = tmp * 1.01;
        key[ii].p = rnk * 1.01;
    }


    std::chrono::high_resolution_clock::time_point TendMain = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>( TendMain - TstartMain );
    double time_uni = static_cast<double>(duration.count()) / 1000000;

    std::cout << "\n Random array access " << time_uni << "s \n" ;

    // ==========================================Sequential======================================================
    TstartMain = std::chrono::high_resolution_clock::now();

    for(int j=0; j < 1000; j++)
    for(int i=0; i < NN; i++){
        tmp = key[i].x[1];
        rnk = key[i].p;
        key[i].x[1] = tmp * 1.01;
        key[i].p = rnk * 1.01;
    }

    TendMain = std::chrono::high_resolution_clock::now();
    duration = std::chrono::duration_cast<std::chrono::microseconds>( TendMain - TstartMain );
    time_uni = static_cast<double>(duration.count()) / 1000000;

    std::cout << " Sequential array access " << time_uni << "s \n" ;
    // ================================================================================================
    delete [] key;
    delete [] sequence;
}

As expected, sequential access is faster; 正如预期的那样,顺序访问更快。 the answer is following on my machine- 答案就在我的机器上

Random array access 21.3763s 
Sequential array access 8.7755s 

The main question is whether random access could be made any faster. 主要的问题是随机访问是否可以更快地进行。 The code improvement could be in terms of the container itself ( eg list/vector rather than array). 代码改进可以针对容器本身(例如列表/向量而不是数组)。 Could software prefetching be implemented? 是否可以实施软件预取?

In theory it is possible to help guide the pre-fetcher to speed up random access (well, on those CPU's that support it - eg _mm_prefetch for Intel/AMD). 理论上讲 ,可以帮助引导预取器加快随机访问的速度(嗯,在支持预取器的CPU上-例如,用于Intel / AMD的_mm_prefetch)。 In practice however this is often a complete waste of time, and will more often than not, slow down your code. 但是实际上,这通常是完全浪费时间,并且通常会减慢代码速度。

The general theory is that you pass a pointer to the _mm_prefetch intrinsic a loop iteration or two prior to using the value. 一般的理论是,在使用该值之前,您需要传递一个指向_mm_prefetch内在函数的指针,该指针要进行一两次或两次循环迭代。 There are however problems with this: 但是,这有问题:

  • It is likely that you'll end up tuning the code for your CPU. 很可能,你会最终调整的代码你的 CPU。 When running that same code on other platforms, you'll probably find that different CPU cache layouts/sizes mean that your prefetch optimisations are now actually slowing the performance down. 在其他平台上运行相同的代码时,您可能会发现不同的CPU缓存布局/大小意味着您的预取优化实际上正在降低性能。
  • The additional prefetch instructions will end up using up more of your instruction cache, and most likely your uop cache as well. 额外的预取指令最终将占用更多的指令缓存,并且很可能还会占用uop缓存。 You may find this alone slows the code down. 您可能会发现仅此一项会降低代码速度。
  • This assumes the CPU actually pays attention to the _mm_prefetch instruction. 假设CPU实际上关注_mm_prefetch指令。 It is only a hint, so there are no guarentees it will be respected by the CPU. 这只是一个提示,因此没有保证,CPU会尊重它。

If you want to speed up random memory access, there are better methods than prefetching imho. 如果要加快随机内存访问的速度,则有比预取imho更好的方法。

  • Reduce the size of the data (ie use shorts/float16s inplace of int/float, eradicate any erronious padding in your structs, etc) . 减小数据的大小(即使用short / float16s代替int / float,根除结构中任何错误的填充等) By reducing the size of the structs, you have less memory to read, so it will go quicker! 通过减小结构的大小,您可以减少读取的内存,因此速度更快! (Simple compression schemes aren't a bad idea either!) (简单的压缩方案也不是坏主意!)
  • Sort your data so that instead of doing random access, you are processing the data sequentially. 对数据进行排序,以便您不进行随机访问,而是按顺序处理数据。

Other than those two options, the best bet is to leave prefetching well alone, and the compiler do it's thing with your random access (The only exception: you are optimising code for a ~2001 Pentium 4, where prefetching was basically required) . 除了这两个选项之外,最好的选择是让预取保持良好状态,并且编译器通过您的随机访问来完成它(唯一的例外:您正在为〜2001 Pentium 4优化代码,在该程序中基本上需要预取)

To give an example of what @robthebloke says, the following code makes ~15% improvment on my machine: 举一个@robthebloke所说的例子,以下代码使我的机器提升了约15%的性能:

#include <immintrin.h>

void do_it(struct Astr *key, const int *sequence)  {
    for(int i = 0; i < NN-8; ++i) {
        _mm_prefetch(key + sequence[i+8], _MM_HINT_NTA);
        struct Astr *ki = key+sequence[i];
        ki->x[1] *= 1.01;
        ki->p *= 1.01;
    }
    for(int i = NN-8; i < NN; ++i) {
        struct Astr *ki = key+sequence[i];
        ki->x[1] *= 1.01;
        ki->p *= 1.01;
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM