简体   繁体   中英

Using FMA (fused multiply) instructions for complex multiplication

I'd like to leverage available fused multiply add/subtract CPU instructions to assist in complex multiplication over a decently sized array. Essentially, the basic math looks like so:

void ComplexMultiplyAddToArray(float* pDstR, float* pDstI, const float* pSrc1R, const float* pSrc1I, const float* pSrc2R, const float* pSrc2I, int len)
{
    for (int i = 0; i < len; ++i)
    {
        const float fSrc1R = pSrc1R[i];
        const float fSrc1I = pSrc1I[i];
        const float fSrc2R = pSrc2R[i];
        const float fSrc2I = pSrc2I[i];

        //  Perform complex multiplication on the input and accumulate with the output
        pDstR[i] += fSrc1R*fSrc2R - fSrc1I*fSrc2I;
        pDstI[i] += fSrc1R*fSrc2I + fSrc2R*fSrc1I;
    }
}

As you can probably see, the data is structured where we have separate arrays of real numbers and imaginary numbers. Now, suppose I have the following functions available as intrinsics to single instructions that perform a b+c and a b-c respectively:

float fmadd(float a, float b, float c);
float fmsub(float a, float b, float c);

Naively, I can see that I can replace 2 multiplies, one add, and one subtract with one fmadd and one fmsub, like so:

//  Perform complex multiplication on the input and accumulate with the output
pDstR[i] += fmsub(fSrc1R, fSrc2R, fSrc1I*fSrc2I);
pDstI[i] += fmadd(fSrc1R, fSrc2I, fSrc2R*fSrc1I);

This results in very modest performance improvements, along with, I assume, accuracy, but I think I'm really missing something where the math can be modified algebraically such that I can replace a couple more mult/add or mult/sub combinations. In each line, there's an extra add, and an extra multiply that I feel like I can convert to a single fma, but frustratingly, I can't figure out how to do it without changing the order of operations and getting the wrong result. Any math experts with ideas?

For the sake of the question, the target platform probably isn't that important, as I'm aware these kinds of instructions exist on various platforms.

That is a good start. You can reduce one more addition:

//  Perform complex multiplication on the input and accumulate with the output
pDstR[i] += fmsub(fSrc1R, fSrc2R, fSrc1I*fSrc2I);
pDstI[i] += fmadd(fSrc1R, fSrc2I, fSrc2R*fSrc1I);

Here you can make use of another fmadd in the calculation of the imaginary part:

pDstI[i] = fmadd(fSrc1R, fSrc2I, fmadd(fSrc2R, fSrc1I, pDstI[i]));

Likewise you can do the same with the real part, but you need to negate the argument. If this makes things faster or slower depend a lot on the micro-timing of the architecture you're working on:

pDstR[i] = fmsub(fSrc1R, fSrc2R, fmadd(fSrc1I, fSrc2I, -pDstR[i]));

Btw, you may get further performance improvements if you declare your destination arrays as non aliasing by using the restrict keyword. Right now the compiler must assume that pDstR and pDstI may overlap or point to the same chunk of memory. That would prevent the compiler from loading pDstI[i] before doing the write to pDstR[i].

Afterwards some careful loop-unrolling may also help if the compiler isn't already doing that. Check the assembler output of your compiler!

I've found the following (with a little help) seems to result in the correct answer:

pDstR[i] = fmsub(fSrc1R, fSrc2R, fmsub(fSrc1I, fSrc2I, pDstR[i]));
pDstI[i] = fmadd(fSrc1R, fSrc2I, fmadd(fSrc2R, fSrc1I, pDstI[i]));

But oddly, doesn't improve performance as much on AVX as leaving the real result part of the math using half FMA, but having the imaginary result use full FMA:

pDstR[i] += fmsub(fSrc1R, fSrc2R, fSrc1I*fSrc2I);
pDstI[i] = fmadd(fSrc1R, fSrc2I, fmadd(fSrc2R, fSrc1I, pDstI[i]));

Thanks everyone for the help.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM