c# multiplying array elements using system.numerics

Question

I'm experimenting with System.Numerics to multiple array elements. Is there a faster way of multiplying the element of the resultant vector (accVector) together? Currently accVector needs to be converted to an array where the elements are multiplied together using LINQ.

        private double VectorMultiplication(double[] array)
        {
            int vectorSize = Vector<double>.Count;
            var accVector = Vector<double>.One;
            int i;

            for (i = 0; i <= array.Length - vectorSize; i += vectorSize)
            {
                var v = new Vector<double>(array, i);
                accVector = Vector.Multiply(accVector, v);
            }

            var tempArray = new double[Vector<double>.Count];
            accVector.CopyTo(tempArray);
            var result = tempArray.Aggregate(1d, (p, d) => p * d);

            for (; i < array.Length; i++)
            {
                result *= array[i];
            }
            return result;
        }

Answer 1

Is there a faster way of multiplying the element of the resultant vector (accVector) together?

Within Sytem.Numerics, no. As mentioned by Peter in the comments, usually you would start by splitting a 256bit vector into two 128bit halves and multiply them, then use shuffles to handle the 128bit part. But System.Numerics offers no shuffles, and it does not let you choose the size of the vector that you're using.

The usual approach can be used with the System.Runtime.Intrinsics.X86 API , which requires .NET Core 3.0 or higher.

For example:

static double product(Vector256<double> vec)
{
    var t = Sse2.Multiply(vec.GetLower(), vec.GetUpper());
    return t.GetElement(0) * t.GetElement(1);
}

That looks like it might be bad, leaving a mysterious GetElement up to the JIT engine to figure out, but actually the codegen is really reasonable:

 vmovupd     ymm0,ymmword ptr [rcx] 
 vextractf128 xmm0,ymm0,1  
 vmovupd     ymm1,ymmword ptr [rcx]  
 vmulpd      xmm0,xmm1,xmm0  
 vmovaps     xmm1,xmm0  
 vpshufd     xmm0,xmm0,0EEh  
 vmulsd      xmm0,xmm0,xmm1

So it looks like GetElement(0) is implicit and GetElement(1) results in a vpshufd , that's fine. Copying xmm0 to xmm1 instead of using a non-destructive vpshufd is a bit mysterious but not that bad, overall better than I normally expect of .NET.. I tested this function non-inlined, usually it should be inlined and the loads should go away.

The main loop can be improved, because the throughput of multiplication is much better than its latency. Right now the multiplications are done one at the time (that is, one vector multiplication at the time) with a delay in between (5 cycles on Haswell, 4 on Broadwell and newer) to wait for the previous multiplication to finish, but for example an Intel Haswell could be starting two multiplications per cycle which is 10 times as much. Realistically the improvement wouldn't be that big, but creating some opportunity for instruction level parallelism helps.

For example (not tested):

var acc0 = Vector<double>.One;
var acc1 = Vector<double>.One;
var acc2 = Vector<double>.One;
var acc3 = Vector<double>.One;
var acc4 = Vector<double>.One;
var acc5 = Vector<double>.One;
var acc6 = Vector<double>.One;
var acc7 = Vector<double>.One;
int i;

for (i = 0; i <= array.Length - vectorSize * 8; i += vectorSize * 8)
{
    acc0 = Vector.Multiply(acc0, new Vector<double>(array, i));
    acc1 = Vector.Multiply(acc1, new Vector<double>(array, i + vectorSize));
    acc2 = Vector.Multiply(acc2, new Vector<double>(array, i + vectorSize * 2));
    acc3 = Vector.Multiply(acc3, new Vector<double>(array, i + vectorSize * 3));
    acc4 = Vector.Multiply(acc4, new Vector<double>(array, i + vectorSize * 4));
    acc5 = Vector.Multiply(acc5, new Vector<double>(array, i + vectorSize * 5));
    acc6 = Vector.Multiply(acc6, new Vector<double>(array, i + vectorSize * 6));
    acc7 = Vector.Multiply(acc7, new Vector<double>(array, i + vectorSize * 7));
}
acc0 = Vector.Multiply(acc0, acc1);
acc2 = Vector.Multiply(acc2, acc3);
acc4 = Vector.Multiply(acc4, acc5);
acc6 = Vector.Multiply(acc6, acc7);
acc0 = Vector.Multiply(acc0, acc2);
acc4 = Vector.Multiply(acc4, acc6);
acc0 = Vector.Multiply(acc0, acc4);
// from here on it's the same
var tempArray = new double[Vector<double>.Count];
acc0.CopyTo(tempArray);
var result = tempArray.Aggregate(1d, (p, d) => p * d);
for (; i < array.Length; i++)
    result *= array[i];

This makes that last loop run potentially 8 times as much as it used to, that could be avoided by having an extra single-vector-per-iteration loop.

c# multiplying array elements using system.numerics

Question

1 answers

solution1
2 2020-02-10 09:14:45

c# multiplying array elements using system.numerics

Question

1 answers

solution1 2 2020-02-10 09:14:45

solution1
2 2020-02-10 09:14:45