I'm experimenting with System.Numerics to multiple array elements. Is there a faster way of multiplying the element of the resultant vector (accVector) together? Currently accVector needs to be converted to an array where the elements are multiplied together using LINQ.
private double VectorMultiplication(double[] array)
{
int vectorSize = Vector<double>.Count;
var accVector = Vector<double>.One;
int i;
for (i = 0; i <= array.Length - vectorSize; i += vectorSize)
{
var v = new Vector<double>(array, i);
accVector = Vector.Multiply(accVector, v);
}
var tempArray = new double[Vector<double>.Count];
accVector.CopyTo(tempArray);
var result = tempArray.Aggregate(1d, (p, d) => p * d);
for (; i < array.Length; i++)
{
result *= array[i];
}
return result;
}
Is there a faster way of multiplying the element of the resultant vector (accVector) together?
Within Sytem.Numerics, no. As mentioned by Peter in the comments, usually you would start by splitting a 256bit vector into two 128bit halves and multiply them, then use shuffles to handle the 128bit part. But System.Numerics offers no shuffles, and it does not let you choose the size of the vector that you're using.
The usual approach can be used with the System.Runtime.Intrinsics.X86 API , which requires .NET Core 3.0 or higher.
For example:
static double product(Vector256<double> vec)
{
var t = Sse2.Multiply(vec.GetLower(), vec.GetUpper());
return t.GetElement(0) * t.GetElement(1);
}
That looks like it might be bad, leaving a mysterious GetElement
up to the JIT engine to figure out, but actually the codegen is really reasonable:
vmovupd ymm0,ymmword ptr [rcx]
vextractf128 xmm0,ymm0,1
vmovupd ymm1,ymmword ptr [rcx]
vmulpd xmm0,xmm1,xmm0
vmovaps xmm1,xmm0
vpshufd xmm0,xmm0,0EEh
vmulsd xmm0,xmm0,xmm1
So it looks like GetElement(0)
is implicit and GetElement(1)
results in a vpshufd
, that's fine. Copying xmm0
to xmm1
instead of using a non-destructive vpshufd
is a bit mysterious but not that bad, overall better than I normally expect of .NET.. I tested this function non-inlined, usually it should be inlined and the loads should go away.
The main loop can be improved, because the throughput of multiplication is much better than its latency. Right now the multiplications are done one at the time (that is, one vector multiplication at the time) with a delay in between (5 cycles on Haswell, 4 on Broadwell and newer) to wait for the previous multiplication to finish, but for example an Intel Haswell could be starting two multiplications per cycle which is 10 times as much. Realistically the improvement wouldn't be that big, but creating some opportunity for instruction level parallelism helps.
For example (not tested):
var acc0 = Vector<double>.One;
var acc1 = Vector<double>.One;
var acc2 = Vector<double>.One;
var acc3 = Vector<double>.One;
var acc4 = Vector<double>.One;
var acc5 = Vector<double>.One;
var acc6 = Vector<double>.One;
var acc7 = Vector<double>.One;
int i;
for (i = 0; i <= array.Length - vectorSize * 8; i += vectorSize * 8)
{
acc0 = Vector.Multiply(acc0, new Vector<double>(array, i));
acc1 = Vector.Multiply(acc1, new Vector<double>(array, i + vectorSize));
acc2 = Vector.Multiply(acc2, new Vector<double>(array, i + vectorSize * 2));
acc3 = Vector.Multiply(acc3, new Vector<double>(array, i + vectorSize * 3));
acc4 = Vector.Multiply(acc4, new Vector<double>(array, i + vectorSize * 4));
acc5 = Vector.Multiply(acc5, new Vector<double>(array, i + vectorSize * 5));
acc6 = Vector.Multiply(acc6, new Vector<double>(array, i + vectorSize * 6));
acc7 = Vector.Multiply(acc7, new Vector<double>(array, i + vectorSize * 7));
}
acc0 = Vector.Multiply(acc0, acc1);
acc2 = Vector.Multiply(acc2, acc3);
acc4 = Vector.Multiply(acc4, acc5);
acc6 = Vector.Multiply(acc6, acc7);
acc0 = Vector.Multiply(acc0, acc2);
acc4 = Vector.Multiply(acc4, acc6);
acc0 = Vector.Multiply(acc0, acc4);
// from here on it's the same
var tempArray = new double[Vector<double>.Count];
acc0.CopyTo(tempArray);
var result = tempArray.Aggregate(1d, (p, d) => p * d);
for (; i < array.Length; i++)
result *= array[i];
This makes that last loop run potentially 8 times as much as it used to, that could be avoided by having an extra single-vector-per-iteration loop.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.