How can I sum sub-matrices in Eigen

Question

I have some matrix defined as:

Eigen::MatrixXd DPCint = Eigen::MatrixXd::Zero(p.szZ*(p.na-1),p.szX);

\\ perform some computations and fill every sub-matrix of size [p.szZ,p.szX] with some values
#pragma omp parallel for
for (int i=0; i < p.na-1; i++)
{
...
DPCint(Eigen::seq(i*p.szZ,(i+1)*p.szZ-1),Eigen::all) = ....;
}

\\ Now sum every p.szZ rows to get a matrix that is [p.szZ,p.szX]

In Matlab, this operation is fast and trivial. I can't simply do a += operation here if I want to parallelize the loop with OpenMP. Similarly, I can loop through and sum every set of p.szZ rows, but that loop cannot be parallelized since each thread would be outputting to the same data. Is there some efficient way to use Eigen's indexing operations to sum sub-matrices? This seems like a simple operation and I feel like I'm missing something, but I haven't been able to find a solution for some time.

Clarification

Essentially, after the above loop, I'd like to do this in a single line:

for (int i = 0; i < p.na-1; i++)
{
DPC += DPCint(Eigen::seq(i*p.szZ,(i+1)*p.szZ-1),Eigen::all);
}

In matlab, I can simply reshape the matrix into a 3D matrix and sum along the third dimension. I'm not familiar with Eigen's tensor library, and I hope this operation is doable without resorting to using the tensor library. However, my priority is speed and efficiency, so I'm open to any suggestions.

Answer 1

Performing a parallel reduction over the na -based axis is not efficient. Indeed, this dimension is already pretty small for multiple threads to be useful, but it also (nearly) force threads to operate on temporary matrices which is inefficient (this is memory-bound so it does not scale well).

An alternative solution is to parallelize the szZ dimension . Each thread can work on a slice and perform a local reduction without temporary matrices. Moreover, this approach should also improve the use of CPU caches (since a sections of DPC computed by each threads are more likely to fit in cache so they are not reloaded from RAM). Here is an (untested) example:

// All thread will execute the following loops (all iterations but on different data blocks)
#pragma omp parallel
for (int i = 0; i < p.na-1; i++)
{
    // "nowait" avoid a synchronization but this require a 
    // static schedule which is a good idea to use here anyway.
    #pragma omp for schedule(static) nowait
    for (int j = 0; j < p.szZ; j++)
        DPC(j, Eigen::all) += DPCint(i*p.szZ+j, Eigen::all);
}

As pointed out by @chtz, it should be better to avoid using a temporary DPCint matrix since the memory throughput is a very limited resource (especially in parallel code).

EDIT: I assumed the matrices are stored in a row-major storage order which is not the case by default. This can be modified (see the doc ) and in fact it would make the first and the second loops cache-efficient. However mixing storage order is generally error-prone and using a row-major ordering force you to redefine basic types. The solution of @Homer512 is an alternative implementation certainly better suited for column-major matrices.

Answer 2

Here is my take.

#pragma omp parallel
{
     /*
      * We force static schedule to prevent excessive cache-line bouncing
      * because the elements per thread are not consecutive.
      * However, most (all?) OpenMP implementations use static scheduling
      * by default anyway.
      * Switching to threads initializing full columns would be
      * more effective from a memory POV.
      */
#    pragma omp for schedule(static)
     for(int i=0; i < p.na-1; i++) {
         /*
          * Note: The original code looks wrong.
          * Remember that indices in Eigen (as with most things C++)
          * are exclusive on the end. This touches
          * [start, end), not [start, end]
          */
         DPCint(Eigen::seq(i*p.szZ,(i+1)*p.szZ),Eigen::all) = ...;
         /*
          * Same as
          * DPCint.middleRows(i*p.szZ, p.szZ) = ...
          */
     }
     /*
      * We rely on the implicit barrier at the end of the for-construct
      * for synchronization. Then start a new loop in the same parallel
      * construct. This one can be nowait as it is the last one.
      * Again, static scheduling limits cache-line bouncing to the first
      * and last column/cache line per thread.
      * But since we wrote rows per thread above and now read
      * columns per thread, there are still a lot of cache misses
      */
#    pragma omp for schedule(static) nowait
     for(int i=0; i < p.szX; i++) {
         /*
          * Now we let a single thread reduce a column.
          * Not a row because we deal with column-major matrices
          * so this pattern is more cache-efficient
          */
         DPC.col(i) += DPCint.col(i).reshaped(
               p.szZ, p.na - 1).rowwise().sum(); 
     }
}

Reshaping is new in Eigen-3.4. However, I noticed that the resulting assembly isn't particularly effective ( no vectorization ).

Rowwise reductions have always been somewhat slow in Eigen. So we might do better like this, which also works in Eigen-3.3:

#    pragma omp for schedule(static) nowait
     for(int i = 0; i < p.szX; i++) {
         const auto& incol = DPCint.col(i);
         auto outcol = DPC.col(i);
         for(int j = 0; j < p.na - 1; j++)
             outcol += incol.segment(j * (p.na - 1), p.na - 1); 
     }

Alternatively, multiplying the reshaped matrix with an all-ones vector also works surprisingly well. It needs benchmarking but, especially with Eigen using OpenBLAS, it could be faster than rowwise summation.

Benchmarking

Okay, I went ahead and did some tests. First, let's set up a minimum reproducible example because we didn't have one before.

void reference(Eigen::Ref<Eigen::MatrixXd> DPC,
               int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
#   pragma omp parallel for
    for(Eigen::Index a = 0; a < na; ++a)
        for(Eigen::Index x = 0; x < szX; ++x)
            for(Eigen::Index z = 0; z < szZ; ++z)
                DPCint(a * szZ + z, x) =
                      a * 0.25 + x * 1.34 + z * 12.68;
    for(Eigen::Index a = 0; a < na; ++a)
        DPC += DPCint.middleRows(a * szZ, szZ);
}
void test(Eigen::Ref<Eigen::MatrixXd> DPC,
          int na)
{...}
int main()
{
    const int szZ = 500, szX = 192, na = 15;
    const int repetitions = 10000;
    Eigen::MatrixXd ref = Eigen::MatrixXd::Zero(szZ, szX);
    Eigen::MatrixXd opt = Eigen::MatrixXd::Zero(szZ, szX);
    reference(ref, na);
    test(opt, na);
    std::cout << (ref - opt).cwiseAbs().sum() << std::endl;
    for(int i = 0; i < repetitions; ++i)
        test(opt, na);
}

The array dimensions are as described by OP. The DPCint initialization was chosen to be scalar and allow testing that any optimized implementation is still correct. The number of repetitions was picked for reasonable runtime.

Compiled and tested with g++-10 -O3 -march=native -DNDEBUG -fopenmp on an AMD Ryzen Threadripper 2990WX (32 core, 64 thread). NUMA enabled. Using Eigen-3.4.0.

The reference gives 16.6 seconds.

Let's optimize the initialization to get this out of the way:

void reference_op1(Eigen::Ref<Eigen::MatrixXd> DPC,
                   int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const auto avals = Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const Eigen::VectorXd zvals =
          Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel for collapse(2)
    for(Eigen::Index a = 0; a < na; ++a)
        for(Eigen::Index x = 0; x < szX; ++x)
            DPCint.col(x).segment(a * szZ, szZ) = zvals.array() + xvals[x] + avals[a];
    for(Eigen::Index a = 0; a < na; ++a)
        DPC += DPCint.middleRows(a * szZ, szZ);
}

The linspaced isn't really helping but notice the collapse(2) . Since na is only 15 on a 64 thread machine, we need to parallelize over two loops. 15.4 seconds

Let's test my proposed version:

void rowwise(Eigen::Ref<Eigen::MatrixXd> DPC,
             int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const auto avals = Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const Eigen::VectorXd zvals =
          Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel
    {
#       pragma omp for collapse(2)
        for(Eigen::Index a = 0; a < na; ++a)
            for(Eigen::Index x = 0; x < szX; ++x)
                DPCint.col(x).segment(a * szZ, szZ) =
                      zvals.array() + xvals[x] + avals[a];

#       pragma omp for nowait
        for(Eigen::Index x = 0; x < szX; ++x)
              DPC.col(x) += DPCint.col(x).reshaped(szZ, na).rowwise().sum();
    }
}

Runs at 12.5 seconds. Not a lot of speedup given that we just parallelized the second half of our algorithm.

As I suggested earlier, rowwise reductions are crap and can be avoided with matrix-vector products. Let's see if this helps here:

void rowwise_dot(Eigen::Ref<Eigen::MatrixXd> DPC,
                 int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const auto avals = Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const Eigen::VectorXd zvals =
          Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
    const Eigen::VectorXd ones = Eigen::VectorXd::Ones(szZ);
#   pragma omp parallel
    {
#       pragma omp for collapse(2)
        for(Eigen::Index a = 0; a < na; ++a)
            for(Eigen::Index x = 0; x < szX; ++x)
                DPCint.col(x).segment(a * szZ, szZ) =
                      zvals.array() + xvals[x] + avals[a];

#       pragma omp for nowait
        for(Eigen::Index x = 0; x < szX; ++x)
            DPC.col(x).noalias() +=
                  DPCint.col(x).reshaped(szZ, na) * ones;
    }
}

Nope, still 12.5 seconds. What happens when we compile with -DEIGEN_USE_BLAS -lopenblas_openmp ? Same number. Might be worth it if you cannot compile for AVX2 but the CPU supports it. Eigen has no support for runtime CPU feature detection. Or it might help with float more than with double because the benefit of vectorization is higher.

What if we build our own rowwise reduction in a way that vectorizes?

void rowwise_loop(Eigen::Ref<Eigen::MatrixXd> DPC,
                  int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const auto avals = Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const Eigen::VectorXd zvals =
          Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel
    {
#       pragma omp for collapse(2)
        for(Eigen::Index a = 0; a < na; ++a)
            for(Eigen::Index x = 0; x < szX; ++x)
                DPCint.col(x).segment(a * szZ, szZ) =
                      zvals.array() + xvals[x] + avals[a];

#       pragma omp for nowait
        for(Eigen::Index x = 0; x < szX; ++x)
            for(Eigen::Index a = 0; a < na; ++a)
                DPC.col(x) += DPCint.col(x).segment(a * szZ, szZ);
    }
}

13.3 seconds. Note that on my laptop (Intel i7-8850H), this was significantly faster than the rowwise version. NUMA and cache line bouncing may be a serious issue on the larger threadripper but I didn't investigate perf counters.

Reordering DPCint

At this point I think it becomes apparent that the layout of the DPCint and the loop ordering in its setup are a liability. Maybe there is a reason for it. But if there isn't, I propose changing it as follows:

void reordered(Eigen::Ref<Eigen::MatrixXd> DPC,
               int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const Eigen::VectorXd avals =
          Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const auto zvals = Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel
    {
#       pragma omp for
        for(Eigen::Index x = 0; x < szX; ++x)
            for(Eigen::Index z = 0; z < szZ; ++z)
                DPCint.col(x).segment(z * na, na) =
                      avals.array() + xvals[x] + zvals[z];

#       pragma omp for nowait
        for(Eigen::Index x = 0; x < szX; ++x)
            DPC.col(x) += DPCint.col(x).reshaped(na, szZ).colwise().sum();
    }
}

The idea is to reshape it in such a way that a) colwise sums are possible and b) The same thread touches the same elements in the first and second loop.

Interestingly, this seems slower at 15.3 seconds. I guess the innermost assignment is now too short.

What happens if we fold both parts of the algorithm into one loop, reducing the synchronization overhead and improving caching?

void reordered_folded(Eigen::Ref<Eigen::MatrixXd> DPC,
                        int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const Eigen::VectorXd avals =
          Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const auto zvals = Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel for
    for(Eigen::Index x = 0; x < szX; ++x) {
        for(Eigen::Index z = 0; z < szZ; ++z)
            DPCint.col(x).segment(z * na, na) =
                  avals.array() + xvals[x] + zvals[z];
        DPC.col(x) += DPCint.col(x).reshaped(na, szZ).colwise().sum();
    }
}

12.3 seconds. At this point, why do we even have a shared DPCint array? Let's use a per-thread matrix.

void reordered_loctmp(Eigen::Ref<Eigen::MatrixXd> DPC,
                      int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    const Eigen::VectorXd avals =
        Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const auto zvals = Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel
    {
        Eigen::MatrixXd DPCint(na, szZ);
#       pragma omp for nowait
        for(Eigen::Index x = 0; x < szX; ++x) {
            for(Eigen::Index z = 0; z < szZ; ++z)
                DPCint.col(z) = avals.array() + xvals[x] + zvals[z];
            DPC.col(x) += DPCint.colwise().sum();
        }
    }
}

Heureka! 6.8 seconds. We eliminated cache-line bounding. We made everything cache-friendly and properly vectorized.

The only thing I can think of now is turning DPCint into an expression that is evaluated on the fly but this very much depends on the actual expression. Since I cannot speculate on that, I'll leave it at that.

How can I sum sub-matrices in Eigen

Question

2 answers

solution1
3 2022-05-28 03:08:50

solution2
2 ACCPTED 2022-05-28 07:50:28

Benchmarking

Reordering DPCint

How can I sum sub-matrices in Eigen

Question

2 answers

solution1 3 2022-05-28 03:08:50

solution2 2 ACCPTED 2022-05-28 07:50:28

Benchmarking

Reordering DPCint

solution1
3 2022-05-28 03:08:50

solution2
2 ACCPTED 2022-05-28 07:50:28