How to improve cwiseProduct operations?

Question

I have this function that is called several times in my code:

    void Grid::computeFVarsSigma(const int DFAType,
                                const Matrix& D_sigma,
                                const Matrix& Phi,
                                const Matrix& DPhiDx,
                                const Matrix& DPhiDy,
                                const Matrix& DPhiDz,
                                Matrix& Rho,
                                Matrix& DRhoDx,
                                Matrix& DRhoDy,
                                Matrix& DRhoDz)
{
    // auto PhiD = Phi * D_sigma;
    Rho = ((Phi * D_sigma).cwiseProduct(Phi)).rowwise().sum();

    if (DFAType == 1)
    {
        DRhoDx = 2. * ((Phi * D_sigma).cwiseProduct(DPhiDx)).rowwise().sum();
        DRhoDy = 2. * ((Phi * D_sigma).cwiseProduct(DPhiDy)).rowwise().sum();
        DRhoDz = 2. * ((Phi * D_sigma).cwiseProduct(DPhiDz)).rowwise().sum();
    }
}

For the use case I took to benchmark, the input arrays have the following dimensions:

D_sigma     42 42
Phi     402264 42
DPhiDx  402264 42
DPhiDy  402264 42
DPhiDz  402264 42

The average time when this function is called 12 times is 0.621 seconds , measured with std::chrono::high_resolution_clock . I'm running these calculations in a AMD Ryzen5 compiled with g++ 7.5.0. I can bump the compilers version but I'm most interested for now in code optimizations.

One idea that I'd like to explore is to store the cwiseProduct computations of DRhoDx, DRhoDy and DRhoDz directly in a 3xNGridPoints Matrix. However, I don't know how to do it yet.

Are there any other manipulations that I could try to improve this function?

Thank in advance for your comments.

I would like to thanks @chatz and @Homer512 for very nice suggestions. I was very happy with the one-liner optimization proposed by @chatz however, @Homer512 suggestions' had a drastic change in performance as shown in the figure below (special thanks to @Homer512.). I will certainly use both suggestions as a starting point to improve other parts of my code.

Note, I'm using double and in the figure below return param and return tuple stand for the same function returning the output as a tuple and as parameters, respectively.

Answer 1

Let M=402264, N=42, then in your case the Phi*D_sigma product takes M*N² FMA operations, the cwiseProduct with the sum M*N FMA operations. You can safe some significant work, if you compute Phi * D_sigma only once, but you need to actually evaluate the result, eg

 Matrix PhiD = Phi * D_sigma;  // DO NOT USE `auto` HERE!
 Rho = PhiD.cwiseProduct(Phi).rowwise().sum();
 if(...) // etc

Answer 2

I'll do this optimization in steps. First we establish a base value.

You didn't give a type definition for your Matrix type. So I define it as Eigen::MatrixXf . Also, just for my own sanity, I redefine the various Rho vectors as such. Note that Eigen occasionally has optimized code paths for vectors compared to matrices that just happen to be vectors. So doing this is a good idea anyway plus it makes reading the code easier.

using Matrix = Eigen::MatrixXf;
using Vector = Eigen::VectorXf;

namespace {
void compute(const Matrix& Phi, const Matrix& D_sigma, const Matrix& DPhi,
             float factor, Vector& Rho)
{
    Rho = (Phi * D_sigma).cwiseProduct(DPhi).rowwise().sum() * factor;
}
} /* namespace anonymous */

void computeFVarsSigma(const int DFAType, const Matrix& D_sigma,
        const Matrix& Phi, const Matrix& DPhiDx, const Matrix& DPhiDy,
        const Matrix& DPhiDz, Vector& Rho, Vector& DRhoDx, Vector& DRhoDy,
        Vector& DRhoDz)
{
    compute(Phi, D_sigma, Phi, 1.f, Rho);
    if (DFAType == 1) {
        compute(Phi, D_sigma, DPhiDx, 2.f, DRhoDx);
        compute(Phi, D_sigma, DPhiDy, 2.f, DRhoDy);
        compute(Phi, D_sigma, DPhiDz, 2.f, DRhoDz);
    }
}

The first optimization, as proposed by @chtz, is to cache the matrix multiplication. Don't use auto for this, as noted in Eigen's documentation.

namespace {
void compute(const Matrix& PhiD, const Matrix& DPhi, float factor, Vector& Rho)
{
    Rho = PhiD.cwiseProduct(DPhi).rowwise().sum() * factor;
}
} /* namespace anonymous */

void computeFVarsSigma(const int DFAType, const Matrix& D_sigma,
        const Matrix& Phi, const Matrix& DPhiDx, const Matrix& DPhiDy,
        const Matrix& DPhiDz, Vector& Rho, Vector& DRhoDx, Vector& DRhoDy,
        Vector& DRhoDz)
{
    const Matrix PhiD = Phi * D_sigma;
    compute(PhiD, Phi, 1.f, Rho);
    if (DFAType == 1) {
        compute(PhiD, DPhiDx, 2.f, DRhoDx);
        compute(PhiD, DPhiDy, 2.f, DRhoDy);
        compute(PhiD, DPhiDz, 2.f, DRhoDz);
    }
}

This is now 3.15 times as fast on my system.

The second step is to reduce the amount of memory required by doing the operation blockwise. The idea is pretty simple: We are somewhat constrained by memory bandwidth, especially since the matrix-matrix product is rather "thin". Plus it helps with the step after this.

Here I pick a block size of 384 rows. My rule of thumb is that the inputs and outputs should fit into the L2 cache (128-256 kiB, possibly shared by 2 threads) and that it should be a multiple of 16 for good vectorization across the board. 384 rows * 42 columns * 4 byte per float = 64 kiB . Adjust as required for other scalar types but from my tests it is actually not very sensitive.

Take care to use Eigen::Ref or appropriate templates to avoid copies, as I did here in the compute helper function.

namespace {
void compute(const Matrix& PhiD, const Eigen::Ref<const Matrix>& DPhi,
             float factor, Eigen::Ref<Vector> Rho)
{
    Rho = PhiD.cwiseProduct(DPhi).rowwise().sum() * factor;
}
} /* namespace anonymous */

void computeFVarsSigma(const int DFAType, const Matrix& D_sigma,
        const Matrix& Phi, const Matrix& DPhiDx, const Matrix& DPhiDy,
        const Matrix& DPhiDz, Vector& Rho, Vector& DRhoDx, Vector& DRhoDy,
        Vector& DRhoDz)
{
    const Eigen::Index n = Phi.rows(), blocksize = 384;
    Rho.resize(n);
    if(DFAType == 1)
        for(Vector* vec: {&DRhoDx, &DRhoDy, &DRhoDz})
            vec->resize(n);
    Matrix PhiD;
    for(Eigen::Index i = 0; i < n; i += blocksize) {
        const Eigen::Index cur = std::min(blocksize, n - i);
        PhiD.noalias() = Phi.middleRows(i, cur) * D_sigma;
        compute(PhiD, Phi.middleRows(i, cur), 1.f, Rho.segment(i, cur));
        if (DFAType == 1) {
            compute(PhiD, DPhiDx.middleRows(i, cur), 2.f,
                    DRhoDx.segment(i, cur));
            compute(PhiD, DPhiDy.middleRows(i, cur), 2.f,
                    DRhoDy.segment(i, cur));
            compute(PhiD, DPhiDz.middleRows(i, cur), 2.f,
                    DRhoDz.segment(i, cur));
        }
    }
}

This is another speedup by a factor of 1.75.

Now that we have this, we can parallelize very easily. Eigen can parallelize the matrix-matrix multiplication internally but not the rest so we do it all externally. The blockwise version works better because it can keep all threads busy all the time and it makes better use of the combined L2 cache capacity of the system. Compile with -fopenmp

namespace {
void compute(const Matrix& PhiD, const Eigen::Ref<const Matrix>& DPhi,
             float factor, Eigen::Ref<Vector> Rho)
{
    Rho = PhiD.cwiseProduct(DPhi).rowwise().sum() * factor;
}
} /* namespace anonymous */

void computeFVarsSigma(const int DFAType, const Matrix& D_sigma,
        const Matrix& Phi, const Matrix& DPhiDx, const Matrix& DPhiDy,
        const Matrix& DPhiDz, Vector& Rho, Vector& DRhoDx, Vector& DRhoDy,
        Vector& DRhoDz)
{
    const Eigen::Index n = Phi.rows(), blocksize = 384;
    Rho.resize(n);
    if(DFAType == 1)
        for(Vector* vec: {&DRhoDx, &DRhoDy, &DRhoDz})
            vec->resize(n);
#   pragma omp parallel
    {
        Matrix PhiD;
#       pragma omp for nowait
        for(Eigen::Index i = 0; i < n; i += blocksize) {
            const Eigen::Index cur = std::min(blocksize, n - i);
            PhiD.noalias() = Phi.middleRows(i, cur) * D_sigma;
            compute(PhiD, Phi.middleRows(i, cur), 1.f, Rho.segment(i, cur));
            if (DFAType == 1) {
                compute(PhiD, DPhiDx.middleRows(i, cur), 2.f,
                        DRhoDx.segment(i, cur));
                compute(PhiD, DPhiDy.middleRows(i, cur), 2.f,
                        DRhoDy.segment(i, cur));
                compute(PhiD, DPhiDz.middleRows(i, cur), 2.f,
                        DRhoDz.segment(i, cur));
            }
        }
    }
}

Interestingly this doesn't produce a huge benefit on my system, only a factor of 1.25 with 8 cores / 16 threads. I have not investigated what's the actual bottleneck. I guess it's my main memory bandwidth. A system with lower per-core bandwidth and/or higher per-node bandwidth (Xeons, Threadrippers) may benefit more.

One last proposal, but that is situational: Transpose the Phi and DPhiDx/y/z matrices. This allows two further optimizations for column-major matrices such as those used by Eigen :

General matrix-matrix multiplications are fastest when they are written in the pattern A.transpose() * B . Transposing the elements in Phi allows us to write PhiD = D_sigma.transpose() * Phi
Column-wise reductions are faster than row-wise except for very small number of columns such as in MatrixX4f

namespace {
void compute(const Matrix& PhiD, const Eigen::Ref<const Matrix>& DPhi,
             float factor, Eigen::Ref<Vector> Rho)
{
    Rho = PhiD.cwiseProduct(DPhi).colwise().sum() * factor;
}
} /* namespace anonymous */

void computeFVarsSigma(const int DFAType, const Matrix& D_sigma,
        const Matrix& Phi, const Matrix& DPhiDx, const Matrix& DPhiDy,
        const Matrix& DPhiDz, Vector& Rho, Vector& DRhoDx, Vector& DRhoDy,
        Vector& DRhoDz)
{
    const Eigen::Index n = Phi.cols(), blocksize = 384;
    Rho.resize(n);
    if(DFAType == 1)
        for(Vector* vec: {&DRhoDx, &DRhoDy, &DRhoDz})
            vec->resize(n);
#   pragma omp parallel
    {
        Matrix PhiD;
#       pragma omp for nowait
        for(Eigen::Index i = 0; i < n; i += blocksize) {
            const Eigen::Index cur = std::min(blocksize, n - i);
            PhiD.noalias() = D_sigma.transpose() * Phi.middleCols(i, cur);
            compute(PhiD, Phi.middleCols(i, cur), 1.f, Rho.segment(i, cur));
            if (DFAType == 1) {
                compute(PhiD, DPhiDx.middleCols(i, cur), 2.f,
                        DRhoDx.segment(i, cur));
                compute(PhiD, DPhiDy.middleCols(i, cur), 2.f,
                        DRhoDy.segment(i, cur));
                compute(PhiD, DPhiDz.middleCols(i, cur), 2.f,
                        DRhoDz.segment(i, cur));
            }
        }
    }
}

This brings another speedup by a factor of 1.14. I would assume some greater advantage if the inner dimension grows from 42 to something closer to 100 or 1000 and also if the bottleneck above is not so pronounced.

Improvement through decomposition

There is a neat trick you can apply for the (Phi * D_sigma).cwiseProduct(Phi).rowwise().sum() case:

Let p be a row vector of Phi, S be D_sigma and d be the scalar result for this one row. Then what we compute is

d = p * S * p'

If S is positive semidefinite, we can use an LDLT decomposition:

S = P' * L * D * L' * P

into the permutation matrix P , a lower triangular matrix L and a diagonal matrix D .

From this follows:

d = p * P' * L * D * L' * P * p'
d = (p * P') * (L * sqrt(D)) * (sqrt(D) * L') * (P * p')
d = ||(P * p) * (L * sqrt(D))||^2

The (P * p) is a simple permutation. The (L * sqrt(D)) is another fast and simple operation since D is just a diagonal matrix. The final multiplication of the (P * p) vector with the (L * sqrt(D)) matrix is also cheaper than before because L is a triangular matrix. So you can use Eigen's triangularView<Eigen::Lower> to save operations.

Since the decomposition may fail, you have to provide the original approach as a fall-back.

How to improve cwiseProduct operations?

Question

2 answers

solution1
2 2023-01-19 22:30:48

solution2
2 ACCPTED 2023-01-20 11:41:47

Improvement through decomposition

How to improve cwiseProduct operations?

Question

2 answers

solution1 2 2023-01-19 22:30:48

solution2 2 ACCPTED 2023-01-20 11:41:47

Improvement through decomposition

solution1
2 2023-01-19 22:30:48

solution2
2 ACCPTED 2023-01-20 11:41:47