简体   繁体   English

如何在 Eigen 中对子矩阵求和

[英]How can I sum sub-matrices in Eigen

I have some matrix defined as:我有一些矩阵定义为:

Eigen::MatrixXd DPCint = Eigen::MatrixXd::Zero(p.szZ*(p.na-1),p.szX);

\\ perform some computations and fill every sub-matrix of size [p.szZ,p.szX] with some values
#pragma omp parallel for
for (int i=0; i < p.na-1; i++)
{
...
DPCint(Eigen::seq(i*p.szZ,(i+1)*p.szZ-1),Eigen::all) = ....;
}

\\ Now sum every p.szZ rows to get a matrix that is [p.szZ,p.szX]

In Matlab, this operation is fast and trivial.在 Matlab 中,此操作快速而简单。 I can't simply do a += operation here if I want to parallelize the loop with OpenMP.如果我想用 OpenMP 并行化循环,我不能在这里简单地执行 += 操作。 Similarly, I can loop through and sum every set of p.szZ rows, but that loop cannot be parallelized since each thread would be outputting to the same data.同样,我可以循环遍历每组 p.szZ 行并对它们求和,但是该循环不能并行化,因为每个线程都将输出到相同的数据。 Is there some efficient way to use Eigen's indexing operations to sum sub-matrices?有没有一些有效的方法来使用 Eigen 的索引操作来求和子矩阵? This seems like a simple operation and I feel like I'm missing something, but I haven't been able to find a solution for some time.这似乎是一个简单的操作,我觉得我错过了一些东西,但我已经有一段时间找不到解决方案了。

Clarification澄清

Essentially, after the above loop, I'd like to do this in a single line:本质上,在上述循环之后,我想在一行中执行此操作:

for (int i = 0; i < p.na-1; i++)
{
DPC += DPCint(Eigen::seq(i*p.szZ,(i+1)*p.szZ-1),Eigen::all);
}

In matlab, I can simply reshape the matrix into a 3D matrix and sum along the third dimension.在 matlab 中,我可以简单地将矩阵重塑为 3D 矩阵并沿第三维求和。 I'm not familiar with Eigen's tensor library, and I hope this operation is doable without resorting to using the tensor library.我不熟悉 Eigen 的张量库,我希望这个操作在不使用张量库的情况下是可行的。 However, my priority is speed and efficiency, so I'm open to any suggestions.但是,我的首要任务是速度和效率,所以我愿意接受任何建议。

Performing a parallel reduction over the na -based axis is not efficient.在基于na的轴上执行并行归约效率不高。 Indeed, this dimension is already pretty small for multiple threads to be useful, but it also (nearly) force threads to operate on temporary matrices which is inefficient (this is memory-bound so it does not scale well).实际上,这个维度对于多个线程来说已经非常小了,但它也(几乎)强制线程在效率低下的临时矩阵上操作(这是内存绑定的,因此它不能很好地扩展)。

An alternative solution is to parallelize the szZ dimension .另一种解决方案是并行化szZ维度 Each thread can work on a slice and perform a local reduction without temporary matrices.每个线程都可以在切片上工作并执行局部缩减,而无需临时矩阵。 Moreover, this approach should also improve the use of CPU caches (since a sections of DPC computed by each threads are more likely to fit in cache so they are not reloaded from RAM).此外,这种方法还应该改进 CPU 缓存的使用(因为每个线程计算的DPC部分更可能适合缓存,因此它们不会从 RAM 重新加载)。 Here is an (untested) example:这是一个(未经测试的)示例:

// All thread will execute the following loops (all iterations but on different data blocks)
#pragma omp parallel
for (int i = 0; i < p.na-1; i++)
{
    // "nowait" avoid a synchronization but this require a 
    // static schedule which is a good idea to use here anyway.
    #pragma omp for schedule(static) nowait
    for (int j = 0; j < p.szZ; j++)
        DPC(j, Eigen::all) += DPCint(i*p.szZ+j, Eigen::all);
}

As pointed out by @chtz, it should be better to avoid using a temporary DPCint matrix since the memory throughput is a very limited resource (especially in parallel code).正如@chtz 所指出的,最好避免使用临时DPCint矩阵,因为内存吞吐量是非常有限的资源(尤其是在并行代码中)。

EDIT: I assumed the matrices are stored in a row-major storage order which is not the case by default.编辑:我假设矩阵存储在以行为主的存储顺序中,默认情况下并非如此。 This can be modified (see the doc ) and in fact it would make the first and the second loops cache-efficient.这可以修改(参见 文档),实际上它会使第一个和第二个循环缓存效率。 However mixing storage order is generally error-prone and using a row-major ordering force you to redefine basic types.然而,混合存储顺序通常容易出错,并且使用以行为主的顺序会迫使您重新定义基本类型。 The solution of @Homer512 is an alternative implementation certainly better suited for column-major matrices. @Homer512 的解决方案是一种替代实现,当然更适合列主矩阵。

Here is my take.这是我的看法。

#pragma omp parallel
{
     /*
      * We force static schedule to prevent excessive cache-line bouncing
      * because the elements per thread are not consecutive.
      * However, most (all?) OpenMP implementations use static scheduling
      * by default anyway.
      * Switching to threads initializing full columns would be
      * more effective from a memory POV.
      */
#    pragma omp for schedule(static)
     for(int i=0; i < p.na-1; i++) {
         /*
          * Note: The original code looks wrong.
          * Remember that indices in Eigen (as with most things C++)
          * are exclusive on the end. This touches
          * [start, end), not [start, end]
          */
         DPCint(Eigen::seq(i*p.szZ,(i+1)*p.szZ),Eigen::all) = ...;
         /*
          * Same as
          * DPCint.middleRows(i*p.szZ, p.szZ) = ...
          */
     }
     /*
      * We rely on the implicit barrier at the end of the for-construct
      * for synchronization. Then start a new loop in the same parallel
      * construct. This one can be nowait as it is the last one.
      * Again, static scheduling limits cache-line bouncing to the first
      * and last column/cache line per thread.
      * But since we wrote rows per thread above and now read
      * columns per thread, there are still a lot of cache misses
      */
#    pragma omp for schedule(static) nowait
     for(int i=0; i < p.szX; i++) {
         /*
          * Now we let a single thread reduce a column.
          * Not a row because we deal with column-major matrices
          * so this pattern is more cache-efficient
          */
         DPC.col(i) += DPCint.col(i).reshaped(
               p.szZ, p.na - 1).rowwise().sum(); 
     }
}

Reshaping is new in Eigen-3.4.重塑是 Eigen-3.4 中的新功能。 However, I noticed that the resulting assembly isn't particularly effective ( no vectorization ).但是,我注意到生成的程序集并不是特别有效( 没有矢量化)。

Rowwise reductions have always been somewhat slow in Eigen.在 Eigen 中,逐行减少总是有些缓慢。 So we might do better like this, which also works in Eigen-3.3:所以我们可能会像这样做得更好,这也适用于 Eigen-3.3:

#    pragma omp for schedule(static) nowait
     for(int i = 0; i < p.szX; i++) {
         const auto& incol = DPCint.col(i);
         auto outcol = DPC.col(i);
         for(int j = 0; j < p.na - 1; j++)
             outcol += incol.segment(j * (p.na - 1), p.na - 1); 
     }

Alternatively, multiplying the reshaped matrix with an all-ones vector also works surprisingly well.或者,将重塑后的矩阵与全为向量相乘也可以很好地工作。 It needs benchmarking but, especially with Eigen using OpenBLAS, it could be faster than rowwise summation.它需要基准测试,但尤其是使用 OpenBLAS 的 Eigen,它可能比逐行求和更快。

Benchmarking基准测试

Okay, I went ahead and did some tests.好的,我继续进行了一些测试。 First, let's set up a minimum reproducible example because we didn't have one before.首先,让我们建立一个最小可重现的例子,因为我们以前没有。

void reference(Eigen::Ref<Eigen::MatrixXd> DPC,
               int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
#   pragma omp parallel for
    for(Eigen::Index a = 0; a < na; ++a)
        for(Eigen::Index x = 0; x < szX; ++x)
            for(Eigen::Index z = 0; z < szZ; ++z)
                DPCint(a * szZ + z, x) =
                      a * 0.25 + x * 1.34 + z * 12.68;
    for(Eigen::Index a = 0; a < na; ++a)
        DPC += DPCint.middleRows(a * szZ, szZ);
}
void test(Eigen::Ref<Eigen::MatrixXd> DPC,
          int na)
{...}
int main()
{
    const int szZ = 500, szX = 192, na = 15;
    const int repetitions = 10000;
    Eigen::MatrixXd ref = Eigen::MatrixXd::Zero(szZ, szX);
    Eigen::MatrixXd opt = Eigen::MatrixXd::Zero(szZ, szX);
    reference(ref, na);
    test(opt, na);
    std::cout << (ref - opt).cwiseAbs().sum() << std::endl;
    for(int i = 0; i < repetitions; ++i)
        test(opt, na);
}

The array dimensions are as described by OP.数组维度如 OP 所述。 The DPCint initialization was chosen to be scalar and allow testing that any optimized implementation is still correct. DPCint 初始化被选择为标量,并允许测试任何优化的实现是否仍然正确。 The number of repetitions was picked for reasonable runtime.为合理的运行时间选择了重复次数。

Compiled and tested with g++-10 -O3 -march=native -DNDEBUG -fopenmp on an AMD Ryzen Threadripper 2990WX (32 core, 64 thread).在 AMD Ryzen Threadripper 2990WX(32 核,64 线程)上使用g++-10 -O3 -march=native -DNDEBUG -fopenmp编译和测试。 NUMA enabled. NUMA 已启用。 Using Eigen-3.4.0.使用 Eigen-3.4.0。

The reference gives 16.6 seconds.参考给出了 16.6 秒。

Let's optimize the initialization to get this out of the way:让我们优化初始化以解决这个问题:

void reference_op1(Eigen::Ref<Eigen::MatrixXd> DPC,
                   int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const auto avals = Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const Eigen::VectorXd zvals =
          Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel for collapse(2)
    for(Eigen::Index a = 0; a < na; ++a)
        for(Eigen::Index x = 0; x < szX; ++x)
            DPCint.col(x).segment(a * szZ, szZ) = zvals.array() + xvals[x] + avals[a];
    for(Eigen::Index a = 0; a < na; ++a)
        DPC += DPCint.middleRows(a * szZ, szZ);
}

The linspaced isn't really helping but notice the collapse(2) . linspaced 并没有真正帮助,而是注意到了collapse(2) Since na is only 15 on a 64 thread machine, we need to parallelize over two loops.由于 na 在 64 线程机器上只有 15,因此我们需要并行化两个循环。 15.4 seconds 15.4 秒

Let's test my proposed version:让我们测试一下我提出的版本:

void rowwise(Eigen::Ref<Eigen::MatrixXd> DPC,
             int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const auto avals = Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const Eigen::VectorXd zvals =
          Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel
    {
#       pragma omp for collapse(2)
        for(Eigen::Index a = 0; a < na; ++a)
            for(Eigen::Index x = 0; x < szX; ++x)
                DPCint.col(x).segment(a * szZ, szZ) =
                      zvals.array() + xvals[x] + avals[a];

#       pragma omp for nowait
        for(Eigen::Index x = 0; x < szX; ++x)
              DPC.col(x) += DPCint.col(x).reshaped(szZ, na).rowwise().sum();
    }
}

Runs at 12.5 seconds.运行时间为 12.5 秒。 Not a lot of speedup given that we just parallelized the second half of our algorithm.鉴于我们只是并行化了算法的后半部分,因此加速并不多。

As I suggested earlier, rowwise reductions are crap and can be avoided with matrix-vector products.正如我之前建议的那样,按行减少是废话,可以使用矩阵向量乘积来避免。 Let's see if this helps here:让我们看看这是否有帮助:

void rowwise_dot(Eigen::Ref<Eigen::MatrixXd> DPC,
                 int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const auto avals = Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const Eigen::VectorXd zvals =
          Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
    const Eigen::VectorXd ones = Eigen::VectorXd::Ones(szZ);
#   pragma omp parallel
    {
#       pragma omp for collapse(2)
        for(Eigen::Index a = 0; a < na; ++a)
            for(Eigen::Index x = 0; x < szX; ++x)
                DPCint.col(x).segment(a * szZ, szZ) =
                      zvals.array() + xvals[x] + avals[a];

#       pragma omp for nowait
        for(Eigen::Index x = 0; x < szX; ++x)
            DPC.col(x).noalias() +=
                  DPCint.col(x).reshaped(szZ, na) * ones;
    }
}

Nope, still 12.5 seconds.不,还有 12.5 秒。 What happens when we compile with -DEIGEN_USE_BLAS -lopenblas_openmp ?当我们使用-DEIGEN_USE_BLAS -lopenblas_openmp编译时会发生什么? Same number.同号。 Might be worth it if you cannot compile for AVX2 but the CPU supports it.如果您无法为 AVX2 编译但 CPU 支持它,则可能值得。 Eigen has no support for runtime CPU feature detection. Eigen 不支持运行时 CPU 特征检测。 Or it might help with float more than with double because the benefit of vectorization is higher.或者它可能对 float 的帮助大于对 double 的帮助,因为矢量化的好处更高。

What if we build our own rowwise reduction in a way that vectorizes?如果我们以向量化的方式构建自己的逐行归约呢?

void rowwise_loop(Eigen::Ref<Eigen::MatrixXd> DPC,
                  int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const auto avals = Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const Eigen::VectorXd zvals =
          Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel
    {
#       pragma omp for collapse(2)
        for(Eigen::Index a = 0; a < na; ++a)
            for(Eigen::Index x = 0; x < szX; ++x)
                DPCint.col(x).segment(a * szZ, szZ) =
                      zvals.array() + xvals[x] + avals[a];

#       pragma omp for nowait
        for(Eigen::Index x = 0; x < szX; ++x)
            for(Eigen::Index a = 0; a < na; ++a)
                DPC.col(x) += DPCint.col(x).segment(a * szZ, szZ);
    }
}

13.3 seconds. 13.3 秒。 Note that on my laptop (Intel i7-8850H), this was significantly faster than the rowwise version.请注意,在我的笔记本电脑(Intel i7-8850H)上,这比逐行版本快得多。 NUMA and cache line bouncing may be a serious issue on the larger threadripper but I didn't investigate perf counters. NUMA 和缓存行弹跳在较大的 threadripper 上可能是一个严重的问题,但我没有调查性能计数器。

Reordering DPCint重新排序 DPCint

At this point I think it becomes apparent that the layout of the DPCint and the loop ordering in its setup are a liability.在这一点上,我认为很明显 DPCint 的布局和其设置中的循环顺序是一种负担。 Maybe there is a reason for it.也许是有原因的。 But if there isn't, I propose changing it as follows:但如果没有,我建议将其更改如下:

void reordered(Eigen::Ref<Eigen::MatrixXd> DPC,
               int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const Eigen::VectorXd avals =
          Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const auto zvals = Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel
    {
#       pragma omp for
        for(Eigen::Index x = 0; x < szX; ++x)
            for(Eigen::Index z = 0; z < szZ; ++z)
                DPCint.col(x).segment(z * na, na) =
                      avals.array() + xvals[x] + zvals[z];

#       pragma omp for nowait
        for(Eigen::Index x = 0; x < szX; ++x)
            DPC.col(x) += DPCint.col(x).reshaped(na, szZ).colwise().sum();
    }
}

The idea is to reshape it in such a way that a) colwise sums are possible and b) The same thread touches the same elements in the first and second loop.我们的想法是以这样一种方式重塑它:a)可以进行 colwise sums 和 b)相同的线程在第一个和第二个循环中接触相同的元素。

Interestingly, this seems slower at 15.3 seconds.有趣的是,这在 15.3 秒时似乎更慢。 I guess the innermost assignment is now too short.我猜最里面的任务现在太短了。

What happens if we fold both parts of the algorithm into one loop, reducing the synchronization overhead and improving caching?如果我们将算法的两个部分折叠成一个循环,减少同步开销并改善缓存,会发生什么?

void reordered_folded(Eigen::Ref<Eigen::MatrixXd> DPC,
                        int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const Eigen::VectorXd avals =
          Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const auto zvals = Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel for
    for(Eigen::Index x = 0; x < szX; ++x) {
        for(Eigen::Index z = 0; z < szZ; ++z)
            DPCint.col(x).segment(z * na, na) =
                  avals.array() + xvals[x] + zvals[z];
        DPC.col(x) += DPCint.col(x).reshaped(na, szZ).colwise().sum();
    }
}

12.3 seconds. 12.3 秒。 At this point, why do we even have a shared DPCint array?在这一点上,为什么我们甚至有一个共享的 DPCint 数组? Let's use a per-thread matrix.让我们使用每线程矩阵。

void reordered_loctmp(Eigen::Ref<Eigen::MatrixXd> DPC,
                      int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    const Eigen::VectorXd avals =
        Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const auto zvals = Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel
    {
        Eigen::MatrixXd DPCint(na, szZ);
#       pragma omp for nowait
        for(Eigen::Index x = 0; x < szX; ++x) {
            for(Eigen::Index z = 0; z < szZ; ++z)
                DPCint.col(z) = avals.array() + xvals[x] + zvals[z];
            DPC.col(x) += DPCint.colwise().sum();
        }
    }
}

Heureka!赫里卡! 6.8 seconds. 6.8 秒。 We eliminated cache-line bounding.我们消除了缓存行边界。 We made everything cache-friendly and properly vectorized.我们使所有内容都对缓存友好并正确矢量化。

The only thing I can think of now is turning DPCint into an expression that is evaluated on the fly but this very much depends on the actual expression.我现在唯一能想到的就是将 DPCint 变成一个动态评估的表达式,但这在很大程度上取决于实际的表达式。 Since I cannot speculate on that, I'll leave it at that.由于我无法推测这一点,所以我将把它留在那里。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM