矩阵乘法的特征码比使用 std::vector 的循环乘法运行速度慢

Question

I am learning C++ as well as machine learning, so I decided to use the Eigen library for matrix multiplication.我正在学习 C++ 以及机器学习，所以我决定使用 Eigen 库进行矩阵乘法。 I was training a perceptron to recognise a digit from the MNIST database.我正在训练一个感知器来识别 MNIST 数据库中的一个数字。 For the training phase I set the number of training cycles (or epochs) to T = 100.对于训练阶段，我将训练周期（或 epoch）的数量设置为 T = 100。

The 'training matrix' is a 10000 x 785 matrix. “训练矩阵”是一个 10000 x 785 矩阵。 The zeroth element of each row contains the 'label' identifying the digit to which the input data (the remaining 784 elements of the row) maps to.每行的第零个元素包含标识输入数据（行的其余 784 个元素）映射到的数字的“标签”。

There is also a 784 x 1 'weights' vector which contains the weights for each of the 784 features.还有一个 784 x 1 的“权重”向量，其中包含 784 个特征中每个特征的权重。 The weights vector would be multiplied with each input vector (a row of the training matrix excluding the zeroth element) and would be updated every iteration, and this would happen T times for each of the 10000 inputs.权重向量将与每个输入向量相乘（训练矩阵的一行，不包括第 0 个元素），并且每次迭代都会更新，这对于 10000 个输入中的每一个都会发生 T 次。

I wrote the following program (which captures the essence of what I am doing), where I compared the "vanilla" approach of multiplying the rows of a matrix with the weight vector (using std::vector and loops) to what I felt was the best I could do with an Eigen approach.我编写了以下程序（它抓住了我正在做的事情的本质），其中我将矩阵的行与权重向量（使用 std::vector 和循环）相乘的“香草”方法与我的感觉进行了比较我可以用 Eigen 方法做的最好的事情。 It is not really a multiplication of a matrix with a vector, I am actually slicing the row of the training matrix and multiplying that with the weight vector.它实际上并不是矩阵与向量的乘法，我实际上是对训练矩阵的行进行切片并将其与权重向量相乘。

The time duration for the training period for the std::vector approach was 160.662 ms and for the Eigen method was usually over 10,000 ms. std::vector 方法的训练时间为 160.662 ms，而 Eigen 方法通常超过 10,000 ms。

I compile the program using the following command:我使用以下命令编译程序：

clang++ -Wall -Wextra -pedantic -O3 -march=native -Xpreprocessor -fopenmp permute.cc -o perm -std=c++17

I am using a "mid" 2012 MacBook Pro running macOS Catalina and having 2.5 GHz dual core i5.我正在使用运行 macOS Catalina 并具有 2.5 GHz 双核 i5 的“mid”2012 MacBook Pro。

#include <iostream>
#include <algorithm>
#include <random>
#include <Eigen/Dense>
#include <ctime>
#include <chrono>
using namespace Eigen;

int main() {
    Matrix<uint8_t, Dynamic, Dynamic> m = Matrix<uint8_t, Dynamic, Dynamic>::Random(10000, 785);
    Matrix<double, 784, 1> weights_m = Matrix<double, 784, 1>::Random(784, 1);
    Matrix<uint8_t, 10000, 1> y_m, t_m;

    std::minstd_rand rng;
    rng.seed(time(NULL));
    std::uniform_int_distribution<> dist(0,1); //random integers between 0 and 1
    for (int i = 0; i < y_m.rows(); i++) {
        y_m(i) = dist(rng);
        t_m(i) = dist(rng);
    }

    int T = 100;
    int err;
    double eta;
    eta = 0.25; //learning rate
    Matrix<double, 1, 1> sum_wx_m;

    auto start1 = std::chrono::steady_clock::now(); //start of Eigen Matrix loop

    for (int iter = 0; iter < T; iter++) {
        for (int i = 0; i < m.rows(); i++) {
            sum_wx_m = m.block(i, 1, 1, 784).cast<double>() * weights_m;
        
            //some code to update y_m(i) based on the value of sum_wx_m which I left out
        
            err = y_m(i) - t_m(i);
            if (fabs(err) > 0) { //update the weights_m matrix if there's a difference between target and predicted
                weights_m = weights_m - eta * err * m.block(i, 1, 1, 784).transpose().cast<double>();
            } 
        }
    }

    auto end1 = std::chrono::steady_clock::now();
    auto diff1 = end1 - start1;
    std::cout << "Eigen matrix time is "<<std::chrono::duration <double, std::milli> (diff1).count() << " ms" << std::endl;

    //checking how std::vector form performs;

    std::vector<std::vector<uint8_t>> v(10000);
    std::vector<double> weights_v(784);
    std::vector<uint8_t> y_v(10000), t_v(10000);

    for (unsigned long i = 0; i < v.size(); i++) {
        for (int j = 0; j < m.cols(); j++) {
            v[i].push_back(m(i, j));
        }
    }

    for (unsigned long i = 0; i < weights_v.size(); i++) {
        weights_v[i] = weights_m(i);
    }

    for (unsigned long i = 0; i < y_v.size(); i++) {
        y_v[i] = dist(rng);
        t_v[i] = dist(rng);
    }

    double sum_wx_v;

    auto start2 = std::chrono::steady_clock::now(); //start of vector loop

    for (int iter = 0; iter < T; iter++) {
        for(unsigned long j = 0; j < v.size(); j++) {
            sum_wx_v = 0.0;
            for (unsigned long k = 1; k < v[0].size() ; k++) {
                sum_wx_v += weights_v[k - 1] * v[j][k];
            }
        
            //some code to update y_v[i] based on the value of sum_wx_v which I left out
        
            err = y_v[j] - t_v[j];
            if (fabs(err) > 0) {//update the weights_v matrix if there's a difference between target and predicted
                for (unsigned long k = 1; k < v[0].size(); k++) {
                    weights_v[k - 1] -= eta * err * v[j][k];
                }
            }
        }
    }

    auto end2 = std::chrono::steady_clock::now();
    auto diff2 = end2 - start2;
    std::cout << "std::vector time is "<<std::chrono::duration <double, std::milli> (diff2).count() << " ms" << std::endl;
}

What changes should I make to get better running times?我应该进行哪些更改以获得更好的运行时间？

Answer 1

Might not be the best solution but you can try:可能不是最好的解决方案，但您可以尝试：

Since default data order of Eigen is Column-Major you can let your training matrix be 785x10000 such that each training label/data pair will be contiguous in memory (also change the line where sum_wx_m is computed).由于 Eigen 的默认数据顺序是 Column-Major，您可以让您的训练矩阵为 785x10000，这样每个训练标签/数据对在 memory 中都是连续的（也可以更改计算 sum_wx_m 的行）。
Use fixed-size version of block operations, ie, you can replace m.block(i, 1, 1, 784) with m.block<1,784>(i, 1) (in reverse order if you have already switched your training matrix layout or you can simply map the data part of your training matrix and use.col() reference [see the below example])使用固定大小版本的块操作，即您可以将m.block(i, 1, 1, 784)替换为m.block<1,784>(i, 1) （如果您已经切换了训练矩阵，则顺序相反布局，或者您可以简单地 map 训练矩阵的数据部分和 use.col() 参考 [参见下面的示例])

Here is your code modified based on these ideas:这是根据这些想法修改的代码：

#include <iostream>
#include <algorithm>
#include <random>
#include <Eigen/Dense>
#include <ctime>
#include <chrono>
using namespace Eigen;

int main() {
    Matrix<uint8_t, Dynamic, Dynamic> m = Matrix<uint8_t, Dynamic, Dynamic>::Random(785, 10000);
    Map<Matrix<uint8_t, Dynamic, Dynamic>> m_data(m.data() + 785, 784, 10000);

    Matrix<double, 784, 1> weights_m = Matrix<double, 784, 1>::Random(784, 1);
    Matrix<uint8_t, 10000, 1> y_m, t_m;

    std::minstd_rand rng;
    rng.seed(time(NULL));
    std::uniform_int_distribution<> dist(0,1); //random integers between 0 and 1
    for (int i = 0; i < y_m.rows(); i++) {
        y_m(i) = dist(rng);
        t_m(i) = dist(rng);
    }

    int T = 100;
    int err;
    double eta;
    eta = 0.25; //learning rate
     Matrix<double, 1, 1> sum_wx_m;

    auto start1 = std::chrono::steady_clock::now(); //start of Eigen Matrix loop

    for (int iter = 0; iter < T; iter++) {
        for (int i = 0; i < m.cols(); i++) {
            sum_wx_m = weights_m.transpose() * m_data.col(i).cast<double>();
        
            //some code to update y_m(i) based on the value of sum_wx_m which I left out
        
            err = y_m(i) - t_m(i);
            if (fabs(err) > 0) { //update the weights_m matrix if there's a difference between target and predicted
                weights_m = weights_m - eta * err * m_data.col(i).cast<double>();
            } 
        }
    }

    auto end1 = std::chrono::steady_clock::now();
    auto diff1 = end1 - start1;
    std::cout << "Eigen matrix time is "<<std::chrono::duration <double, std::milli> (diff1).count() << " ms" << std::endl;

    //checking how std::vector form performs;

    std::vector<std::vector<uint8_t>> v(10000);
    std::vector<double> weights_v(784);
    std::vector<uint8_t> y_v(10000), t_v(10000);

    for (unsigned long i = 0; i < v.size(); i++) {
        for (int j = 0; j < m.rows(); j++) {
            v[i].push_back(m(j, i));
        }
    }

    for (unsigned long i = 0; i < weights_v.size(); i++) {
        weights_v[i] = weights_m(i);
    }

    for (unsigned long i = 0; i < y_v.size(); i++) {
        y_v[i] = dist(rng);
        t_v[i] = dist(rng);
    }

    double sum_wx_v;

    auto start2 = std::chrono::steady_clock::now(); //start of vector loop

    for (int iter = 0; iter < T; iter++) {
        for(unsigned long j = 0; j < v.size(); j++) {
            sum_wx_v = 0.0;
            for (unsigned long k = 1; k < v[0].size() ; k++) {
                sum_wx_v += weights_v[k - 1] * v[j][k];
            }
        
            //some code to update y_v[i] based on the value of sum_wx_v which I left out
        
            err = y_v[j] - t_v[j];
            if (fabs(err) > 0) {//update the weights_v matrix if there's a difference between target and predicted
                for (unsigned long k = 1; k < v[0].size(); k++) {
                    weights_v[k - 1] -= eta * err * v[j][k];
                }
            }
        }
    }

    auto end2 = std::chrono::steady_clock::now();
    auto diff2 = end2 - start2;
    std::cout << "std::vector time is "<<std::chrono::duration <double, std::milli> (diff2).count() << " ms" << std::endl;
}

I have compiled this code in my Ubuntu Desktop with i7-9700K:我用 i7-9700K 在我的 Ubuntu 桌面上编译了这段代码：

g++ -Wall -Wextra -O3 -std=c++17
====================================
Eigen matrix time is 110.523 ms
std::vector time is 117.826 ms


g++ -Wall -Wextra -O3 -march=native -std=c++17
=============================================
Eigen matrix time is 66.3044 ms
std::vector time is 71.2296 ms

Answer 2

After discussions with users J. Schultke and puhu, I have made the following changes in my code:在与用户 J. Schultke 和 puhu 讨论后，我对我的代码进行了以下更改：

I have changed all the m.block(i, 1, 1, 784) calls to m.block<1, 784>(i, 1) , this reduces the time required for the Eigen matrix loop by a third .我已将所有m.block(i, 1, 1, 784)调用更改为m.block<1, 784>(i, 1) ，这将 Eigen 矩阵循环所需的时间减少了三分之一。 (first suggested by J. Schultke) （首先由 J. Schultke 提出）
I have declared my m matrix as being stored in RowMajor order.我已将我的 m 矩阵声明为按RowMajor顺序存储。 This is because by default the Eigen matrices are stored in ColMajor (column-major) order.这是因为默认情况下，特征矩阵以ColMajor （列优先）顺序存储。 This would cause each entry in a row to be stored contiguously.这将导致连续存储一行中的每个条目。 So now the m.block() calls, which I use to refer to a slice of a row in the m matrix, would simply fetch the whole chunk of memory at once, reducing the "Eigen matrix" time to below the "std::vector" time.所以现在m.block()调用，我用来引用 m 矩阵中一行的切片，将简单地一次获取 memory 的整个块，将“特征矩阵”时间减少到低于“标准： :vector" 时间。 (suggested by puhu) （普虎推荐）

The average runtimes now are现在的平均运行时间是

cpp:Pro$ ./perm
Eigen matrix time is 134.76 ms
std::vector time is 155.574 ms

and the modified code is:修改后的代码是：

#include <iostream>
#include <algorithm>
#include <random>
#include <Eigen/Dense>
#include <chrono>
#include <ctime>
using namespace Eigen;
int main() {
    Matrix<uint8_t, Dynamic, Dynamic, RowMajor> m = Matrix<uint8_t, Dynamic, Dynamic, RowMajor>::Random(10000, 785);
    Matrix<double, 784, 1> weights_m = Matrix<double, 784, 1>::Random(784, 1);
    Matrix<uint8_t, 10000, 1> y_m, t_m;
    std::minstd_rand rng;
    rng.seed(time(NULL));
    std::uniform_int_distribution<> dist(0,1); //random integers between 0 and 1
    for (int i = 0; i < y_m.rows(); i++) {
        y_m(i) = dist(rng);
        t_m(i) = dist(rng);
    }

    int T = 100;
    int err;
    double eta;
    eta = 0.25; //learning rate
    Matrix<double, 1, 1> sum_wx_m;

    auto start1 = std::chrono::steady_clock::now(); //start of Eigen Matrix loop

    for (int iter = 0; iter < T; iter++) {
        for (int i = 0; i < m.rows(); i++) {
            auto b = m.block<1, 784>(i, 1).cast<double>();
            sum_wx_m = b * weights_m;
    
            //some code to update y_m(i) based on the value of sum_wx_m which I left out
    
            err = y_m(i) - t_m(i);
            if (fabs(err) > 0) { //update the weights_m matrix if there's a difference between target and predicted
                weights_m = weights_m - eta * err * b.transpose();
            } 
        }
    }

    auto end1 = std::chrono::steady_clock::now();
    auto diff1 = end1 - start1;
    std::cout << "Eigen matrix time is "<<std::chrono::duration <double, std::milli> (diff1).count() << " ms" << std::endl;

    //checking how std::vector form performs;

    std::vector<std::vector<uint8_t>> v(10000);
    std::vector<double> weights_v(784);
    std::vector<uint8_t> y_v(10000), t_v(10000);

    for (unsigned long i = 0; i < v.size(); i++) {
        for (int j = 0; j < m.cols(); j++) {
            v[i].push_back(m(i, j));
        }
    }

    for (unsigned long i = 0; i < weights_v.size(); i++) {
        weights_v[i] = weights_m(i);
    }

    for (unsigned long i = 0; i < y_v.size(); i++) {
        y_v[i] = dist(rng);
        t_v[i] = dist(rng);
    } 

    double sum_wx_v;

    auto start2 = std::chrono::steady_clock::now(); //start of vector loop

    for (int iter = 0; iter < T; iter++) {
        for(unsigned long j = 0; j < v.size(); j++) {
            sum_wx_v = 0.0;
            for (unsigned long k = 1; k < v[0].size() ; k++) {
                sum_wx_v += weights_v[k - 1] * v[j][k];
            }
    
            //some code to update y_v[i] based on the value of sum_wx_v which I left out
    
            err = y_v[j] - t_v[j];
            if (fabs(err) > 0) {//update the weights_v matrix if there's a difference between target and predicted
                for (unsigned long k = 1; k < v[0].size(); k++) {
                    weights_v[k - 1] -= eta * err * v[j][k];
                }
            }
        }
    }

    auto end2 = std::chrono::steady_clock::now();
    auto diff2 = end2 - start2;
    std::cout << "std::vector time is "<<std::chrono::duration <double, std::milli> (diff2).count() << " ms" << std::endl;
}

矩阵乘法的特征码比使用 std::vector 的循环乘法运行速度慢

问题描述

2 个解决方案

解决方案1
1 2020-08-15 17:34:13

解决方案2
0 已采纳 2020-08-16 03:22:09

矩阵乘法的特征码比使用 std::vector 的循环乘法运行速度慢

问题描述

2 个解决方案

解决方案1 1 2020-08-15 17:34:13

解决方案2 0 已采纳 2020-08-16 03:22:09

解决方案1
1 2020-08-15 17:34:13

解决方案2
0 已采纳 2020-08-16 03:22:09