简体   繁体   中英

Is it possible to use SIMD on a serial dependency in a calculation, like an exponential moving average filter?

I'm processing multiple (independent) Exponential Moving Average 1-Pole filters on different parameters I have within my Audio application, with the intent of smooth each param value at audio rate:

for (int i = 0; i < mParams.GetSize(); i++) {
    mParams.Get(i)->SmoothBlock(blockSize);
}

...

inline void SmoothBlock(int blockSize) {
    double inputA0 = mValue * a0;

    for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
        mSmoothedValues[sampleIndex] = z1 = inputA0 + z1 * b1;
    }
}   

I'd like to take advantage of CPU SIMD instructions, processing them in parallel, but I'm not really sure how I can achieve this.

In fact, z1 is recursive: can't "pack" array of double considering "previous values", right?

Maybe is there a way to properly organize data of different filters and process them in parallel?

Any tips or suggestions would be welcome!

Please note: I don't have several signals paths. Any params represent different controls for the (unique) processing signal. Say I've a sin signal: param 1 will affect gain, param 2 pitch, param 3 filter cutoff, param 4 pan, and so on.

You have a special case where the input signal is Heaviside step function . You want obtain a filter response to this function, which is called a Step response . Recursion in this case can be eliminated. First, let's unroll out the recursion for a few steps.

z[1] = in + z[0]*b
z[2] = in + z[1]*b = in + (in + z[0]*b)*b  = in*(1 + b) + z[0]*b^2
z[3] = in + z[2]*b = in*(1 + b + b^2) + z[0]*b^3
z[4] = in + z[3]*b = in*(1 + b + b^2 + b^3) + z[0]*b^4

From the last eq:

z[1] = in*(1 + b + b^2 + b^3) + z[-3]*b^4
z[2] = in*(1 + b + b^2 + b^3) + z[-2]*b^4
z[3] = in*(1 + b + b^2 + b^3) + z[-1]*b^4
z[4] = in*(1 + b + b^2 + b^3) + z[0]*b^4

Now very easy to rewrite it in a vectorized form.

in' = {in, in, in, in};
z' = in' * (1 + b + b^2 + b^3) + z'*b^4

Where "'" means vector or single SIMD register. Now it will be easy to translate it into immintrin instructions. Note that now you can not change the input value at any sample, but at times multiples of four samples.

In addition, you can represent two or more SIMD registers as one vector and expand recursion more. This will increase performance since better pipeline utilization, but do not overdo it, otherwise you will not have enough registers.

If there's a closed-form formula for n steps ahead, you can use that to sidestep serial dependencies. If it can be computed with just different coefficients with the same operations as for 1 step, a broadcast is all you need.

Like in this case, z1 = c + z1 * b , so applying that twice we get

# I'm using Z0..n as the elements in the series your loop calculates
Z2 = c + (c+Z0*b)*b
   = c + c*b + Z0*b^2

c + c*b and b^2 are both constants, if I'm understanding your code correctly that all the C variables are really just C variables, not pseudocode for an array reference. (So everything except your z1 are loop invariant).


So if we have a SIMD vector of 2 elements, starting with Z0 and Z1, we can step each of them forward by 2 to get Z2 and Z3.

void SmoothBlock(int blockSize, double b, double c, double z_init) {

    // z1 = inputA0 + z1 * b1;
    __m128d zv = _mm_setr_pd(z_init, z_init*b + c);

    __m128d step2_mul = _mm_set1_pd(b*b);
    __m128d step2_add = _mm_set1_pd(c + c*b);

    for (int i = 0; i < blockSize-1; i+=2) {
        _mm_storeu_pd(mSmoothedValues + i, zv);
        zv = _mm_mul_pd(zv, step2_mul);
        zv = _mm_add_pd(zv, step2_add);
       // compile with FMA + fast-math for the compiler to fold the mul/add into one FMA
    }
    // handle final odd element if necessary
    if(blockSize%2 != 0)
        _mm_store_sd(mSmoothedValues+blockSize-1, zv);
}

With float + AVX (8 elements per vector), you'd have

__m256 zv = _mm256_setr_ps(z_init, c + z_init*b,
                         c + c*b + z_init*b*b,
                         c + c*b + c*b*b + z_init*b*b*b, ...);

// Z2 = c + c*b + Z0*b^2
// Z3 = c + c*b + (c + Z0*b) * b^2
// Z3 = c + c*b + c*b^2 + Z0*b^3

and the add/mul factors would be for 8 steps.

Normally people use float for SIMD because you get twice as many elements per vector, and half the memory bandwidth / cache footprint, so you typically get a factor of 2 speedup over float . (Same number of vectors / bytes per clock.)


The above loop on a Haswell or Sandybridge for example CPU will run at one vector per 8 cycles, bottlenecked on the latency of mulpd (5 cycles) + addps (3 cycles). We generate 2 double results per vector, but that's still a huge bottleneck compared to 1 mul and 1 add per clock throughput. We're missing out on a factor of 8 throughput.

(Or if compiled with one FMA instead of mul->add, then we have 5 cycle latency).

Sidestepping the serial dependcy is useful not just for SIMD but for avoiding bottlenecks on FP add/mul (or FMA) latency will give further speedups, up to the ratio of FP add+mul latency to add+mul throughput.

Simply unroll more, and use multiple vectors, like zv0 , zv1 , zv2 , zv3 . This increases the number of steps you make at once, too. So for example, 16-byte vectors of float , with 4 vectors, would be 4x4 = 16 steps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM