[英]Why does C++ multiplication with dynamic array work better than std::vector version
I am implementing the C++ multiplication for matrices with different data structures and techniques (vectors , arrays and OpenMP) and I found a strange situation... My dynamic array version is working better: 我正在为具有不同数据结构和技术(向量,数组和OpenMP)的矩阵实现C ++乘法,但发现一个奇怪的情况……我的动态数组版本运行得更好:
times: 时间:
openmp mult_1: time: 5.882000 s
openmp mult_1:时间:5.882000 s
array mult_2: time: 1.478000 s
阵列mult_2:时间:1.478000 s
My compilation flags are: 我的编译标志是:
/usr/bin/g++ -fopenmp -pthread -std=c++1y -O3
/ usr / bin / g ++ -fopenmp -pthread -std = c ++ 1y -O3
C++ vector version C ++矢量版本
typedef std::vector<std::vector<float>> matrix_f;
void mult_1 (const matrix_f & matrixOne, const matrix_f & matrixTwo, matrix_f & result) {
const int matrixSize = (int)result.size();
#pragma omp parallel for simd
for (int rowResult = 0; rowResult < matrixSize; ++rowResult) {
for (int colResult = 0; colResult < matrixSize; ++colResult) {
for (int k = 0; k < matrixSize; ++k) {
result[rowResult][colResult] += matrixOne[rowResult][k] * matrixTwo[k][colResult];
}
}
}
}
Dynamic array version 动态数组版本
void mult_2 ( float * matrixOne, float * matrixTwo, float * result, int size) {
for (int row = 0; row < size; ++row) {
for (int col = 0; col < size; ++col) {
for (int k = 0; k < size; ++k) {
(*(result+(size*row)+col)) += (*(matrixOne+(size*row)+k)) * (*(matrixTwo+(size*k)+col));
}
}
}
}
tests: 测试:
C++ vector version C ++矢量版本
utils::ChronoTimer timer;
/* set Up simple matrix */
utils::matrix::matrix_f matr1 = std::vector<std::vector<float>>(size,std::vector<float>(size));
fillRandomMatrix(matr1);
utils::matrix::matrix_f matr2 = std::vector<std::vector<float>>(size,std::vector<float>(size));
fillRandomMatrix(matr2);
utils::matrix::matrix_f result = std::vector<std::vector<float>>(size,std::vector<float>(size));
timer.init();
utils::matrix::mult_1(matr1,matr2,result);
std::printf("openmp mult_1: time: %f ms\n",timer.now() / 1000);
Dynamic array version 动态数组版本
utils::ChronoTimer timer;
float *p_matr1 = new float[size*size];
float *p_matr2 = new float[size*size];
float *p_result = new float[size*size];
fillRandomMatrixArray(p_matr1,size);
fillRandomMatrixArray(p_matr2,size);
timer.init();
utils::matrix::mult_2(p_matr1,p_matr2,p_result,size);
std::printf("array mult_2: time: %f ms\n",timer.now() / 1000);
delete [] p_matr1;
delete [] p_matr2;
delete [] p_result;
I was checking some previous posts, but I couldn't find any related with my problem link , link2 , link3 : 我正在检查以前的一些帖子,但是找不到与我的问题相关的任何链接 link2 , link3 :
UPDATE: I refactorized tests with the answers, and vector works slighty better : 更新:我用答案对测试进行了重构,vector的工作稍微好一些:
vector mult: time: 1.194000 s
矢量倍数:时间:1.194000 s
array mult_2: time: 1.202000 s
数组mult_2:时间:1.202000 s
C++ vector version C ++矢量版本
void mult (const std::vector<float> & matrixOne, const std::vector<float> & matrixTwo, std::vector<float> & result, int size) {
for (int row = 0; row < size; ++row) {
for (int col = 0; col < size; ++col) {
for (int k = 0; k <size; ++k) {
result[(size*row)+col] += matrixOne[(size*row)+k] * matrixTwo[(size*k)+col];
}
}
}
}
Dynamic array version 动态数组版本
void mult_2 ( float * matrixOne, float * matrixTwo, float * result, int size) {
for (int row = 0; row < size; ++row) {
for (int col = 0; col < size; ++col) {
for (int k = 0; k < size; ++k) {
(*(result+(size*row)+col)) += (*(matrixOne+(size*row)+k)) * (*(matrixTwo+(size*k)+col));
}
}
}
}
Also, my vectorized version is working better(0.803 s); 另外,我的矢量化版本效果更好(0.803 s);
A vector of vectors is analogous to a jagged array -- an array where each entry is a pointer, each pointer pointing at another array of floats. 向量的向量类似于锯齿形数组-数组,其中每个条目都是一个指针,每个指针都指向另一个浮点数组。
In comparison, the raw array version is one block of memory, where you do math to find the elements. 相比之下,原始数组版本是一个内存块,您可以在其中进行数学运算以查找元素。
Use a single vector, not a vector of vectors, and do the math manually. 使用单个向量,而不是向量的向量,并手动进行数学运算。 Or, use a vector of fix-sized
std::array
s. 或者,使用固定大小的
std::array
的向量。 Or write a helper type that takes the (one-dimensional) vector of float, and gives you a 2 dimensional view of it. 或者编写一个辅助类型,该类型采用float的(一维)向量,并为您提供二维视图。
Data in a contiguous buffer is more cache and optimization friendly than data in a bunch of disconnected buffers. 连续缓冲区中的数据比一堆断开连接的缓冲区中的数据更易于缓存和优化。
template<std::size_t Dim, class T>
struct multi_dim_array_view_helper {
std::size_t const* dims;
T* t;
std::size_t stride() const {
return
multi_dim_array_view_helper<Dim-1, T>{dims+1, nullptr}.stride()
* *dims;
}
multi_dim_array_view_helper<Dim-1, T> operator[](std::size_t i)const{
return {dims+1, t+i*stride()};
}
};
template<class T>
struct multi_dim_array_view_helper<1, T> {
std::size_t stride() const{ return 1; }
T* t;
T& operator[](std::size_t i)const{
return t[i];
}
multi_dim_array_view_helper( std::size_t const*, T* p ):t(p) {}
};
template<std::size_t Dims>
using dims_t = std::array<std::size_t, Dims-1>;
template<std::size_t Dims, class T>
struct multi_dim_array_view_storage
{
dims_t<Dims> storage;
};
template<std::size_t Dims, class T>
struct multi_dim_array_view:
multi_dim_array_view_storage<Dims, T>,
multi_dim_array_view_helper<Dims, T>
{
multi_dim_array_view( dims_t<Dims> d, T* t ):
multi_dim_array_view_storage<Dims, T>{std::move(d)},
multi_dim_array_view_helper<Dims, T>{
this->storage.data(), t
}
{}
};
now you can do this: 现在您可以执行以下操作:
std::vector<float> blah = {
01.f, 02.f, 03.f,
11.f, 12.f, 13.f,
21.f, 22.f, 23.f,
};
multi_dim_array_view<2, float> view = { {3}, blah.data() };
for (std::size_t i = 0; i < 3; ++i )
{
std::cout << "[";
for (std::size_t j = 0; j < 3; ++j )
std::cout << view[i][j] << ",";
std::cout << "]\n";
}
No data is copied in the view class. 没有数据被复制到视图类中。 It just provides a view of the flat array that is a multi-dimensional array.
它只是提供了作为多维数组的平面数组的视图。
Your approaches are quite different: 您的方法大不相同:
In the "dynamic array" version you allocate a single chunk of memory for each matrix and map the rows of the matrices onto that one dimensional memory range. 在“动态数组”版本中,您为每个矩阵分配一个内存块,并将矩阵的行映射到该一维内存范围。
In the "vector" version you use vectors of vectors which are "real" and "dynamically" two dimensional meaning that the storage position of each row of your matrices is unrelated with respect to the other rows. 在“向量”版本中,您使用的向量是“真实”和“动态”二维向量,这意味着矩阵每一行的存储位置与其他行无关。
What you probably want to do is: 您可能想做的是:
Use vector<float>(size*size)
and perform the very same mapping you're doing in the "dynamic array" example by hand or 使用
vector<float>(size*size)
并手动执行与在“动态数组”示例中所做的相同的映射
Write a class that internally handles the mapping for you and provides a 2-dimensional access interface ( T& operator()(size_t, size_t)
or some kind of row_proxy operator[](size_t)
where row_proxy
in turn has T& operator[](size_t)
) 写一个内部为您处理映射并提供二维访问接口的类(
T& operator()(size_t, size_t)
或某种row_proxy operator[](size_t)
,其中row_proxy
依次具有T& operator[](size_t)
)
This is just to enforce the theory (in practice) about the contiguous memory. 这只是为了加强有关连续内存的理论(在实践中)。
After doing some analysis on the code generated with g++ (-O2) the source can be found at: https://gist.github.com/42be237af8e3e2b1ca03 对使用g ++(-O2)生成的代码进行一些分析后,可以在以下位置找到源: https : //gist.github.com/42be237af8e3e2b1ca03
The relevant code generated for the array version is: 为数组版本生成的相关代码为:
.L3:
lea r9, [r13+0+rbx] ; <-------- KEEPS THE ADDRESS
lea r11, [r12+rbx]
xor edx, edx
.L7:
lea r8, [rsi+rdx]
movss xmm1, DWORD PTR [r9]
xor eax, eax
.L6:
movss xmm0, DWORD PTR [r11+rax*4]
add rax, 1
mulss xmm0, DWORD PTR [r8]
add r8, r10
cmp ecx, eax
addss xmm1, xmm0
movss DWORD PTR [r9], xmm1 ; <------------ ADDRESS IS USED
jg .L6
add rdx, 4
add r9, 4 ; <--- ADDRESS INCREMENTED WITH SIZE OF FLOAT
cmp rdx, rdi
jne .L7
add ebp, 1
add rbx, r10
cmp ebp, ecx
jne .L3
see how usage of the value of r9
is reflecting the contiguous memory for the destination array and r8
for one of the input arrays. 了解如何使用
r9
的值反映目标数组和输入数组之一的r8
的连续内存。
On the other end, the vector of vectors generates code like: 另一方面,向量的向量生成如下代码:
.L12:
mov r9, QWORD PTR [r12+r11]
mov rdi, QWORD PTR [rbx+r11]
xor ecx, ecx
.L16:
movss xmm1, DWORD PTR [rdi+rcx]
mov rdx, r10
xor eax, eax
jmp .L15
.L13:
movaps xmm1, xmm0
.L15:
mov rsi, QWORD PTR [rdx]
movss xmm0, DWORD PTR [r9+rax]
add rax, 4
add rdx, 24
cmp r8, rax
mulss xmm0, DWORD PTR [rsi+rcx]
addss xmm0, xmm1
movss DWORD PTR [rdi+rcx], xmm0 ; <------------ HERE
jne .L13
add rcx, 4
cmp rcx, r8
jne .L16
add r11, 24
cmp r11, rbp
jne .L12
Not surprisingly, the compiler is clever enough to not to generate code for all the operator []
calls, and does a good job of inlining them, but see how it needs to track different addresses via rdi + rcx
when it stores the value back to the result vector, and also the extra memory accesses for the various sub-vectors ( mov rsi, QWORD PTR [rdx]
) which all generate some overhead. 毫不奇怪,编译器足够聪明,不会为所有的
operator []
调用生成代码,并且很好地内联了它们,但是看到了将值存储rdi + rcx
时,它如何需要通过rdi + rcx
跟踪不同的地址。结果向量,以及各个子向量( mov rsi, QWORD PTR [rdx]
)的额外内存访问,这些子向量都会产生一些开销。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.