简体   繁体   中英

faster access to random elements in c++ array

What is the fastest way access random (non-sequential) elements in an array if the access pattern is known beforehand? The access is random for different needs at every step so rearranging the elements is expensive option. The code below is represents important sample of the whole application.

#include <iostream>
#include "chrono"
#include <cstdlib>

#define NN 1000000

struct Astr{
    double x[3], v[3];
    int i, j, k;
    long rank, p, q, r;
};


int main ()
{
    struct Astr *key;
    key = new Astr[NN];
    int ii, *sequence;
    sequence = new int[NN]; // access pattern is stored here
    float frac ;

    // create array of structs
    // create array for random numbers between 0 to NN to access 'key'
    for(int i=0; i < NN; i++){
        key[i].x[1] = static_cast<double>(i);
        key[i].p = static_cast<long>(i);
        frac = static_cast<float>(rand()) / static_cast<float>(RAND_MAX);
        sequence[i] = static_cast<int>(frac  * static_cast<float>(NN));
    }

    // part to check and improve
    // =========================================Random=======================================================
    std::chrono::high_resolution_clock::time_point TstartMain = std::chrono::high_resolution_clock::now();
    double tmp;
    long rnk;

    for(int j=0; j < 1000; j++)
    for(int i=0; i < NN; i++){
        ii = sequence[i];
        tmp = key[ii].x[1];
        rnk = key[ii].p;
        key[ii].x[1] = tmp * 1.01;
        key[ii].p = rnk * 1.01;
    }


    std::chrono::high_resolution_clock::time_point TendMain = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>( TendMain - TstartMain );
    double time_uni = static_cast<double>(duration.count()) / 1000000;

    std::cout << "\n Random array access " << time_uni << "s \n" ;

    // ==========================================Sequential======================================================
    TstartMain = std::chrono::high_resolution_clock::now();

    for(int j=0; j < 1000; j++)
    for(int i=0; i < NN; i++){
        tmp = key[i].x[1];
        rnk = key[i].p;
        key[i].x[1] = tmp * 1.01;
        key[i].p = rnk * 1.01;
    }

    TendMain = std::chrono::high_resolution_clock::now();
    duration = std::chrono::duration_cast<std::chrono::microseconds>( TendMain - TstartMain );
    time_uni = static_cast<double>(duration.count()) / 1000000;

    std::cout << " Sequential array access " << time_uni << "s \n" ;
    // ================================================================================================
    delete [] key;
    delete [] sequence;
}

As expected, sequential access is faster; the answer is following on my machine-

Random array access 21.3763s 
Sequential array access 8.7755s 

The main question is whether random access could be made any faster. The code improvement could be in terms of the container itself ( eg list/vector rather than array). Could software prefetching be implemented?

In theory it is possible to help guide the pre-fetcher to speed up random access (well, on those CPU's that support it - eg _mm_prefetch for Intel/AMD). In practice however this is often a complete waste of time, and will more often than not, slow down your code.

The general theory is that you pass a pointer to the _mm_prefetch intrinsic a loop iteration or two prior to using the value. There are however problems with this:

  • It is likely that you'll end up tuning the code for your CPU. When running that same code on other platforms, you'll probably find that different CPU cache layouts/sizes mean that your prefetch optimisations are now actually slowing the performance down.
  • The additional prefetch instructions will end up using up more of your instruction cache, and most likely your uop cache as well. You may find this alone slows the code down.
  • This assumes the CPU actually pays attention to the _mm_prefetch instruction. It is only a hint, so there are no guarentees it will be respected by the CPU.

If you want to speed up random memory access, there are better methods than prefetching imho.

  • Reduce the size of the data (ie use shorts/float16s inplace of int/float, eradicate any erronious padding in your structs, etc) . By reducing the size of the structs, you have less memory to read, so it will go quicker! (Simple compression schemes aren't a bad idea either!)
  • Sort your data so that instead of doing random access, you are processing the data sequentially.

Other than those two options, the best bet is to leave prefetching well alone, and the compiler do it's thing with your random access (The only exception: you are optimising code for a ~2001 Pentium 4, where prefetching was basically required) .

To give an example of what @robthebloke says, the following code makes ~15% improvment on my machine:

#include <immintrin.h>

void do_it(struct Astr *key, const int *sequence)  {
    for(int i = 0; i < NN-8; ++i) {
        _mm_prefetch(key + sequence[i+8], _MM_HINT_NTA);
        struct Astr *ki = key+sequence[i];
        ki->x[1] *= 1.01;
        ki->p *= 1.01;
    }
    for(int i = NN-8; i < NN; ++i) {
        struct Astr *ki = key+sequence[i];
        ki->x[1] *= 1.01;
        ki->p *= 1.01;
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM