简体   繁体   English

为什么C ++与动态数组的乘法比std :: vector版本更好地工作

[英]Why does C++ multiplication with dynamic array work better than std::vector version

I am implementing the C++ multiplication for matrices with different data structures and techniques (vectors , arrays and OpenMP) and I found a strange situation... My dynamic array version is working better: 我正在为具有不同数据结构和技术(向量,数组和OpenMP)的矩阵实现C ++乘法,但发现一个奇怪的情况……我的动态数组版本运行得更好:

times: 时间:

openmp mult_1: time: 5.882000 s openmp mult_1:时间:5.882000 s

array mult_2: time: 1.478000 s 阵列mult_2:时间:1.478000 s

My compilation flags are: 我的编译标志是:

/usr/bin/g++ -fopenmp -pthread -std=c++1y -O3 / usr / bin / g ++ -fopenmp -pthread -std = c ++ 1y -O3

C++ vector version C ++矢量版本

typedef std::vector<std::vector<float>> matrix_f;
void mult_1 (const matrix_f &  matrixOne, const matrix_f & matrixTwo, matrix_f & result) {
    const int matrixSize = (int)result.size();
    #pragma omp parallel for simd
    for (int rowResult = 0; rowResult < matrixSize; ++rowResult) {
        for (int colResult = 0; colResult < matrixSize; ++colResult) {
            for (int k = 0; k < matrixSize; ++k) {
                result[rowResult][colResult] += matrixOne[rowResult][k] * matrixTwo[k][colResult];  
            }
        }
    }
}

Dynamic array version 动态数组版本

void mult_2 ( float *  matrixOne, float * matrixTwo,  float * result, int size)  {
    for (int row = 0; row < size; ++row) {
        for (int col = 0; col < size; ++col) {
            for (int k = 0; k < size; ++k) {
                (*(result+(size*row)+col)) += (*(matrixOne+(size*row)+k)) * (*(matrixTwo+(size*k)+col));
            }
        }
    }
}

tests: 测试:

C++ vector version C ++矢量版本

utils::ChronoTimer timer;
/* set Up simple matrix */
utils::matrix::matrix_f matr1 = std::vector<std::vector<float>>(size,std::vector<float>(size));
fillRandomMatrix(matr1);

utils::matrix::matrix_f matr2 = std::vector<std::vector<float>>(size,std::vector<float>(size));
fillRandomMatrix(matr2);

utils::matrix::matrix_f result = std::vector<std::vector<float>>(size,std::vector<float>(size));    
timer.init();
utils::matrix::mult_1(matr1,matr2,result);
std::printf("openmp mult_1: time: %f ms\n",timer.now() / 1000);

Dynamic array version 动态数组版本

utils::ChronoTimer timer;

float *p_matr1 = new float[size*size];
float *p_matr2 = new float[size*size];
float *p_result = new float[size*size];

fillRandomMatrixArray(p_matr1,size);
fillRandomMatrixArray(p_matr2,size);

timer.init();
utils::matrix::mult_2(p_matr1,p_matr2,p_result,size);
std::printf("array mult_2: time: %f ms\n",timer.now() / 1000);

delete [] p_matr1;
delete [] p_matr2;
delete [] p_result;

I was checking some previous posts, but I couldn't find any related with my problem link , link2 , link3 : 我正在检查以前的一些帖子,但是找不到与我的问题相关的任何链接 link2link3

UPDATE: I refactorized tests with the answers, and vector works slighty better : 更新:我用答案对测试进行了重构,vector的工作稍微好一些:

vector mult: time: 1.194000 s 矢量倍数:时间:1.194000 s

array mult_2: time: 1.202000 s 数组mult_2:时间:1.202000 s

C++ vector version C ++矢量版本

void mult (const std::vector<float> &  matrixOne, const std::vector<float> & matrixTwo, std::vector<float> & result, int size) {
    for (int row = 0; row < size; ++row) {
        for (int col = 0; col < size; ++col) {
            for (int k = 0; k <size; ++k) {
                result[(size*row)+col] += matrixOne[(size*row)+k] * matrixTwo[(size*k)+col];
            }
        }
    }
}

Dynamic array version 动态数组版本

void mult_2 ( float *  matrixOne, float * matrixTwo,  float * result, int size)  {
    for (int row = 0; row < size; ++row) {
        for (int col = 0; col < size; ++col) {
            for (int k = 0; k < size; ++k) {
                (*(result+(size*row)+col)) += (*(matrixOne+(size*row)+k)) * (*(matrixTwo+(size*k)+col));
            }
        }
    }
}

Also, my vectorized version is working better(0.803 s); 另外,我的矢量化版本效果更好(0.803 s);

A vector of vectors is analogous to a jagged array -- an array where each entry is a pointer, each pointer pointing at another array of floats. 向量的向量类似于锯齿形数组-数组,其中每个条目都是一个指针,每个指针都指向另一个浮点数组。

In comparison, the raw array version is one block of memory, where you do math to find the elements. 相比之下,原始数组版本是一个内存块,您可以在其中进行数学运算以查找元素。

Use a single vector, not a vector of vectors, and do the math manually. 使用单个向量,而不是向量的向量,并手动进行数学运算。 Or, use a vector of fix-sized std::array s. 或者,使用固定大小的std::array的向量。 Or write a helper type that takes the (one-dimensional) vector of float, and gives you a 2 dimensional view of it. 或者编写一个辅助类型,该类型采用float的(一维)向量,并为您提供二维视图。

Data in a contiguous buffer is more cache and optimization friendly than data in a bunch of disconnected buffers. 连续缓冲区中的数据比一堆断开连接的缓冲区中的数据更易于缓存和优化。

template<std::size_t Dim, class T>
struct multi_dim_array_view_helper {
  std::size_t const* dims;
  T* t;
  std::size_t stride() const {
    return
      multi_dim_array_view_helper<Dim-1, T>{dims+1, nullptr}.stride()
      * *dims;
  }
  multi_dim_array_view_helper<Dim-1, T> operator[](std::size_t i)const{
    return {dims+1, t+i*stride()};
  }
};
template<class T>
struct multi_dim_array_view_helper<1, T> {
  std::size_t stride() const{ return 1; }
  T* t;
  T& operator[](std::size_t i)const{
    return t[i];
  }
  multi_dim_array_view_helper( std::size_t const*, T* p ):t(p) {}
};
template<std::size_t Dims>
using dims_t = std::array<std::size_t, Dims-1>;
template<std::size_t Dims, class T>
struct multi_dim_array_view_storage
{
  dims_t<Dims> storage;
};
template<std::size_t Dims, class T>
struct multi_dim_array_view:
  multi_dim_array_view_storage<Dims, T>,
  multi_dim_array_view_helper<Dims, T>
{
  multi_dim_array_view( dims_t<Dims> d, T* t ):
    multi_dim_array_view_storage<Dims, T>{std::move(d)},
    multi_dim_array_view_helper<Dims, T>{
      this->storage.data(), t
    }
  {}
};

now you can do this: 现在您可以执行以下操作:

std::vector<float> blah = {
   01.f, 02.f, 03.f,
   11.f, 12.f, 13.f,
   21.f, 22.f, 23.f,
};

multi_dim_array_view<2, float> view = { {3}, blah.data() };
for (std::size_t i = 0; i < 3; ++i )
{
  std::cout << "[";
  for (std::size_t j = 0; j < 3; ++j )
    std::cout << view[i][j] << ",";
  std::cout << "]\n";
}

live example 现场例子

No data is copied in the view class. 没有数据被复制到视图类中。 It just provides a view of the flat array that is a multi-dimensional array. 它只是提供了作为多维数组的平面数组的视图。

Your approaches are quite different: 您的方法大不相同:

  • In the "dynamic array" version you allocate a single chunk of memory for each matrix and map the rows of the matrices onto that one dimensional memory range. 在“动态数组”版本中,您为每个矩阵分配一个内存块,并将矩阵的行映射到该一维内存范围。

  • In the "vector" version you use vectors of vectors which are "real" and "dynamically" two dimensional meaning that the storage position of each row of your matrices is unrelated with respect to the other rows. 在“向量”版本中,您使用的向量是“真实”和“动态”二维向量,这意味着矩阵每一行的存储位置与其他行无关。

What you probably want to do is: 您可能想做的是:

  • Use vector<float>(size*size) and perform the very same mapping you're doing in the "dynamic array" example by hand or 使用vector<float>(size*size)并手动执行与在“动态数组”示例中所做的相同的映射

  • Write a class that internally handles the mapping for you and provides a 2-dimensional access interface ( T& operator()(size_t, size_t) or some kind of row_proxy operator[](size_t) where row_proxy in turn has T& operator[](size_t) ) 写一个内部为您处理映射并提供二维访问接口的类( T& operator()(size_t, size_t)或某种row_proxy operator[](size_t) ,其中row_proxy依次具有T& operator[](size_t)

This is just to enforce the theory (in practice) about the contiguous memory. 这只是为了加强有关连续内存的理论(在实践中)。

After doing some analysis on the code generated with g++ (-O2) the source can be found at: https://gist.github.com/42be237af8e3e2b1ca03 对使用g ++(-O2)生成的代码进行一些分析后,可以在以下位置找到源: https : //gist.github.com/42be237af8e3e2b1ca03

The relevant code generated for the array version is: 为数组版本生成的相关代码为:

.L3:
    lea r9, [r13+0+rbx]                ; <-------- KEEPS THE ADDRESS
    lea r11, [r12+rbx]
    xor edx, edx
.L7:
    lea r8, [rsi+rdx]
    movss   xmm1, DWORD PTR [r9]
    xor eax, eax
.L6:
    movss   xmm0, DWORD PTR [r11+rax*4]
    add rax, 1
    mulss   xmm0, DWORD PTR [r8]
    add r8, r10
    cmp ecx, eax
    addss   xmm1, xmm0
    movss   DWORD PTR [r9], xmm1     ; <------------ ADDRESS IS USED
    jg  .L6
    add rdx, 4
    add r9, 4                        ; <--- ADDRESS INCREMENTED WITH SIZE OF FLOAT
    cmp rdx, rdi
    jne .L7
    add ebp, 1
    add rbx, r10
    cmp ebp, ecx
    jne .L3

see how usage of the value of r9 is reflecting the contiguous memory for the destination array and r8 for one of the input arrays. 了解如何使用r9的值反映目标数组和输入数组之一的r8的连续内存。

On the other end, the vector of vectors generates code like: 另一方面,向量的向量生成如下代码:

.L12:
    mov r9, QWORD PTR [r12+r11]
    mov rdi, QWORD PTR [rbx+r11]
    xor ecx, ecx
.L16:
    movss   xmm1, DWORD PTR [rdi+rcx]
    mov rdx, r10
    xor eax, eax
    jmp .L15
.L13:
    movaps  xmm1, xmm0
.L15:
    mov rsi, QWORD PTR [rdx]
    movss   xmm0, DWORD PTR [r9+rax]
    add rax, 4
    add rdx, 24
    cmp r8, rax
    mulss   xmm0, DWORD PTR [rsi+rcx]
    addss   xmm0, xmm1
    movss   DWORD PTR [rdi+rcx], xmm0   ; <------------ HERE
    jne .L13
    add rcx, 4
    cmp rcx, r8
    jne .L16
    add r11, 24
    cmp r11, rbp
    jne .L12

Not surprisingly, the compiler is clever enough to not to generate code for all the operator [] calls, and does a good job of inlining them, but see how it needs to track different addresses via rdi + rcx when it stores the value back to the result vector, and also the extra memory accesses for the various sub-vectors ( mov rsi, QWORD PTR [rdx] ) which all generate some overhead. 毫不奇怪,编译器足够聪明,不会为所有的operator []调用生成代码,并且很好地内联了它们,但是看到了将值存储rdi + rcx时,它如何需要通过rdi + rcx跟踪不同的地址。结果向量,以及各个子向量( mov rsi, QWORD PTR [rdx] )的额外内存访问,这些子向量都会产生一些开销。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM