[英]faster access to random elements in c++ array
What is the fastest way access random (non-sequential) elements in an array if the access pattern is known beforehand? 如果事先知道访问模式,最快的访问数组中随机(非顺序)元素的方法是什么? The access is random for different needs at every step so rearranging the elements is expensive option. 对于每个步骤,访问都是针对不同需求的随机操作,因此重新排列元素是昂贵的选择。 The code below is represents important sample of the whole application. 下面的代码代表了整个应用程序的重要示例。
#include <iostream>
#include "chrono"
#include <cstdlib>
#define NN 1000000
struct Astr{
double x[3], v[3];
int i, j, k;
long rank, p, q, r;
};
int main ()
{
struct Astr *key;
key = new Astr[NN];
int ii, *sequence;
sequence = new int[NN]; // access pattern is stored here
float frac ;
// create array of structs
// create array for random numbers between 0 to NN to access 'key'
for(int i=0; i < NN; i++){
key[i].x[1] = static_cast<double>(i);
key[i].p = static_cast<long>(i);
frac = static_cast<float>(rand()) / static_cast<float>(RAND_MAX);
sequence[i] = static_cast<int>(frac * static_cast<float>(NN));
}
// part to check and improve
// =========================================Random=======================================================
std::chrono::high_resolution_clock::time_point TstartMain = std::chrono::high_resolution_clock::now();
double tmp;
long rnk;
for(int j=0; j < 1000; j++)
for(int i=0; i < NN; i++){
ii = sequence[i];
tmp = key[ii].x[1];
rnk = key[ii].p;
key[ii].x[1] = tmp * 1.01;
key[ii].p = rnk * 1.01;
}
std::chrono::high_resolution_clock::time_point TendMain = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( TendMain - TstartMain );
double time_uni = static_cast<double>(duration.count()) / 1000000;
std::cout << "\n Random array access " << time_uni << "s \n" ;
// ==========================================Sequential======================================================
TstartMain = std::chrono::high_resolution_clock::now();
for(int j=0; j < 1000; j++)
for(int i=0; i < NN; i++){
tmp = key[i].x[1];
rnk = key[i].p;
key[i].x[1] = tmp * 1.01;
key[i].p = rnk * 1.01;
}
TendMain = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>( TendMain - TstartMain );
time_uni = static_cast<double>(duration.count()) / 1000000;
std::cout << " Sequential array access " << time_uni << "s \n" ;
// ================================================================================================
delete [] key;
delete [] sequence;
}
As expected, sequential access is faster; 正如预期的那样,顺序访问更快。 the answer is following on my machine- 答案就在我的机器上
Random array access 21.3763s
Sequential array access 8.7755s
The main question is whether random access could be made any faster. 主要的问题是随机访问是否可以更快地进行。 The code improvement could be in terms of the container itself ( eg list/vector rather than array). 代码改进可以针对容器本身(例如列表/向量而不是数组)。 Could software prefetching be implemented? 是否可以实施软件预取?
In theory it is possible to help guide the pre-fetcher to speed up random access (well, on those CPU's that support it - eg _mm_prefetch for Intel/AMD). 从理论上讲 ,可以帮助引导预取器加快随机访问的速度(嗯,在支持预取器的CPU上-例如,用于Intel / AMD的_mm_prefetch)。 In practice however this is often a complete waste of time, and will more often than not, slow down your code. 但是实际上,这通常是完全浪费时间,并且通常会减慢代码速度。
The general theory is that you pass a pointer to the _mm_prefetch intrinsic a loop iteration or two prior to using the value. 一般的理论是,在使用该值之前,您需要传递一个指向_mm_prefetch内在函数的指针,该指针要进行一两次或两次循环迭代。 There are however problems with this: 但是,这有问题:
If you want to speed up random memory access, there are better methods than prefetching imho. 如果要加快随机内存访问的速度,则有比预取imho更好的方法。
Other than those two options, the best bet is to leave prefetching well alone, and the compiler do it's thing with your random access (The only exception: you are optimising code for a ~2001 Pentium 4, where prefetching was basically required) . 除了这两个选项之外,最好的选择是让预取保持良好状态,并且编译器通过您的随机访问来完成它(唯一的例外:您正在为〜2001 Pentium 4优化代码,在该程序中基本上需要预取) 。
To give an example of what @robthebloke says, the following code makes ~15% improvment on my machine: 举一个@robthebloke所说的例子,以下代码使我的机器提升了约15%的性能:
#include <immintrin.h>
void do_it(struct Astr *key, const int *sequence) {
for(int i = 0; i < NN-8; ++i) {
_mm_prefetch(key + sequence[i+8], _MM_HINT_NTA);
struct Astr *ki = key+sequence[i];
ki->x[1] *= 1.01;
ki->p *= 1.01;
}
for(int i = NN-8; i < NN; ++i) {
struct Astr *ki = key+sequence[i];
ki->x[1] *= 1.01;
ki->p *= 1.01;
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.