I have three cases to test the relative performance of classes, classes with inheritence and structs. These are to be used for tight loops so performance counts. Dot products are used as part of many algorithms in 2D and 3D geometry and I have run the profiler on real code. The below tests are indicative of real world performance problems I have seen.
The results for 100000000 times through the loop and application of the dot product gives
ControlA 208 ms ( class with inheritence )
ControlB 201 ms ( class with no inheritence )
ControlC 85 ms ( struct )
The tests were being run without debugging and optimization turned on. My question is, what is it about classes in this case that cause them to be so slow?
I presumed the JIT would still be able to inline all the calls, class or struct, so in effect the results should be identical. Note that if I disable optimizations then my results are identical.
ControlA 3239
ControlB 3228
ControlC 3213
They are always within 20ms of each other if the test is re-run.
using System;
using System.Diagnostics;
public class PointControlA
{
public double X
{
get;
set;
}
public double Y
{
get;
set;
}
public PointControlA(double x, double y)
{
X = x;
Y = y;
}
}
public class Point3ControlA : PointControlA
{
public double Z
{
get;
set;
}
public Point3ControlA(double x, double y, double z): base (x, y)
{
Z = z;
}
public static double Dot(Point3ControlA a, Point3ControlA b)
{
return a.X * b.X + a.Y * b.Y + a.Z * b.Z;
}
}
public class Point3ControlB
{
public double X
{
get;
set;
}
public double Y
{
get;
set;
}
public double Z
{
get;
set;
}
public Point3ControlB(double x, double y, double z)
{
X = x;
Y = y;
Z = z;
}
public static double Dot(Point3ControlB a, Point3ControlB b)
{
return a.X * b.X + a.Y * b.Y + a.Z * b.Z;
}
}
public struct Point3ControlC
{
public double X
{
get;
set;
}
public double Y
{
get;
set;
}
public double Z
{
get;
set;
}
public Point3ControlC(double x, double y, double z):this()
{
X = x;
Y = y;
Z = z;
}
public static double Dot(Point3ControlC a, Point3ControlC b)
{
return a.X * b.X + a.Y * b.Y + a.Z * b.Z;
}
}
public class Program
{
public static void TestStructClass()
{
var vControlA = new Point3ControlA(11, 12, 13);
var vControlB = new Point3ControlB(11, 12, 13);
var vControlC = new Point3ControlC(11, 12, 13);
var sw = Stopwatch.StartNew();
var n = 10000000;
double acc = 0;
sw = Stopwatch.StartNew();
for (int i = 0; i < n; i++)
{
acc += Point3ControlA.Dot(vControlA, vControlA);
}
Console.WriteLine("ControlA " + sw.ElapsedMilliseconds);
acc = 0;
sw = Stopwatch.StartNew();
for (int i = 0; i < n; i++)
{
acc += Point3ControlB.Dot(vControlB, vControlB);
}
Console.WriteLine("ControlB " + sw.ElapsedMilliseconds);
acc = 0;
sw = Stopwatch.StartNew();
for (int i = 0; i < n; i++)
{
acc += Point3ControlC.Dot(vControlC, vControlC);
}
Console.WriteLine("ControlC " + sw.ElapsedMilliseconds);
}
public static void Main()
{
TestStructClass();
}
}
This dotnet fiddle is proof of compilation only. It does not show the performance differences.
I am trying to explain to a vendor why their choice to use classes instead of structs for small numeric types is a bad idea. I now have the test case to prove it but I can't understand why.
NOTE : I have tried to set a breakpoint in the debugger with JIT optimizations turned on but the debugger will not break. Looking at the IL with JIT optimizations turned off doesn't tell me anything.
After the answer by @pkuderov I took his code and played with it. I changed the code and found that if I forced inlining via
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static double Dot(Point3Class a)
{
return a.X * a.X + a.Y * a.Y + a.Z * a.Z;
}
the difference between the struct and class for dot product vanished. Why with some setups the attribute is not needed but for me it was is not clear. However I did not give up. There is still a performance problem with the vendor code and I think the DotProduct is not the best example.
I modified @pkuderov's code to implement Vector Add
which will create new instances of the structs and classes. The results are here
https://gist.github.com/bradphelan/9b383c8e99edc38068fcc0dccc8a7b48
In the example I also modifed the code to pick a pseudo random vector from an array to avoid the problem of the instances sticking in the registers ( I hope ).
The results show that:
DotProduct performance is identical or maybe faster for classes
Vector Add, and I assume anything that creates a new object is slower.
Add class/class 2777ms Add struct/struct 2457ms
DotProd class/class 1909ms DotProd struct/struct 2108ms
The full code and results are here if anybody wants to try it out.
For the vector add example where an array of vectors is summed together the struct version keeps the accumulator in 3 registers
var accStruct = new Point3Struct(0, 0, 0);
for (int i = 0; i < n; i++)
accStruct = Point3Struct.Add(accStruct, pointStruct[(i + 1) % m]);
the asm body is
// load the next vector into a register
00007FFA3CA2240E vmovsd xmm3,qword ptr [rax]
00007FFA3CA22413 vmovsd xmm4,qword ptr [rax+8]
00007FFA3CA22419 vmovsd xmm5,qword ptr [rax+10h]
// Sum the accumulator (the accumulator stays in the registers )
00007FFA3CA2241F vaddsd xmm0,xmm0,xmm3
00007FFA3CA22424 vaddsd xmm1,xmm1,xmm4
00007FFA3CA22429 vaddsd xmm2,xmm2,xmm5
but for class based vector version it reads and writes out the accumulator each time to main memory which is inefficient
var accPC = new Point3Class(0, 0, 0);
for (int i = 0; i < n; i++)
accPC = Point3Class.Add(accPC, pointClass[(i + 1) % m]);
the asm body is
// Read and add both accumulator X and Xnext from main memory
00007FFA3CA2224A vmovsd xmm0,qword ptr [r14+8]
00007FFA3CA22250 vmovaps xmm7,xmm0
00007FFA3CA22255 vaddsd xmm7,xmm7,mmword ptr [r12+8]
// Read and add both accumulator Y and Ynext from main memory
00007FFA3CA2225C vmovsd xmm0,qword ptr [r14+10h]
00007FFA3CA22262 vmovaps xmm8,xmm0
00007FFA3CA22267 vaddsd xmm8,xmm8,mmword ptr [r12+10h]
// Read and add both accumulator Z and Znext from main memory
00007FFA3CA2226E vmovsd xmm9,qword ptr [r14+18h]
00007FFA3CA22283 vmovaps xmm0,xmm9
00007FFA3CA22288 vaddsd xmm0,xmm0,mmword ptr [r12+18h]
// Move accumulator accumulator X,Y,Z back to main memory.
00007FFA3CA2228F vmovsd qword ptr [rax+8],xmm7
00007FFA3CA22295 vmovsd qword ptr [rax+10h],xmm8
00007FFA3CA2229B vmovsd qword ptr [rax+18h],xmm0
Update
After spending some time thinking about problem I think I'm aggree with @DavidHaim that memory jump overhead is not the case here because of caching.
Also I've added to your tests more options (and removed first one with inheritance). So I have:
Dot(cl, cl)
- initial method Dot(cl)
- which is "square product" Dot(cl.X, cl.Y, cl.Z, cl.X, cl.Y, cl.Z)
aka Dot(cl.xyz)- pass fields Dot(st, st)
- initial Dot(st)
- square product Dot(st.X, st.Y, st.Z, st.X, st.Y, st.Z)
aka Dot(st.xyz) - pass fields Dot(st6)
- wanted to check if size of struct matters Dot(x, y, z, x, y, z)
aka Dot(xyz) - just local const double variables. Result times are:
...And I don't really sure why I see these results.
Maybe for plain primitive types compiler does more aggresive pass by register optimizations, maybe it's more sure of lifetime boundaries or constantness and then more aggressive optimizations again. Maybe some kind of loop unwinding.
I think my expertise is just not enough :) But still, my results counter your results.
Full test code with results on my machine and generated IL code you can find here .
In C# classes are reference types and structs are value types. One major effect is that value types can be (and most of the time are!) allocated on the stack , while reference types are always allocated on the heap.
So every time you get access to the inner state of a reference type variable you need to dereference the pointer to memory in the heap (it's a kind of jump), while for value types it's already on the stack or even optimized out to registers.
I think you see a difference because of this.
PS btw, by "most of the time are" I meant boxing; it's a technique used to place value type objects on the heap (eg to cast value types to an interface or for dynamic method call binding).
As I thought , this test doesn't prove much.
TLDR: the compiler completely optimizes away the call to Point3ControlC.Dot
while preserves the calls to the other two. the difference is not because structs are faster in this case, but because you skip the entire calculation part.
My settings:
The generated assembly for
for (int i = 0; i < n; i++)
{
acc += Point3ControlA.Dot(vControlA, vControlA);
}
is:
00DC0573 xor edx,edx // temp = 0
00DC0575 mov dword ptr [ebp-10h],edx // i = temp
00DC0578 mov ecx,edi // load vControlA as first parameter
00DC057A mov edx,edi //load vControlA as second parameter
00DC057C call dword ptr ds:[0BA4F0Ch] //call Point3ControlA.Dot
00DC0582 fstp st(0) //store the result
00DC0584 inc dword ptr [ebp-10h] //i++
00DC0587 cmp dword ptr [ebp-10h],989680h //does i == n?
00DC058E jl 00DC0578 //if not, jump to the begining of the loop
After thoughts:
The JIT compiler for some reason did not use a register for i
, so it incremented an integer on the stack ( ebp-10h
) instead. as result, this test has the poorest performance.
Moving on to the second test:
for (int i = 0; i < n; i++)
{
acc += Point3ControlC.Dot(vControlC, vControlC);
}
Generated assembly:
00DC0612 xor edi,edi //i = 0
00DC0614 mov ecx,esi //load vControlB as the first argument
00DC0616 mov edx,esi //load vControlB as the second argument
00DC0618 call dword ptr ds:[0BA4FD4h] // call Point3ControlB.Dot
00DC061E fstp st(0) //store the result
00DC0620 inc edi //++i
00DC0621 cmp edi,989680h //does i == n
00DC0627 jl 00DC0614 //if not, jump to the beginning of the loop
After thoughts: this generated assembly is almost identical to the first one, but this time, the JIT did use a register for i
, hence the minor performance boost over the first test.
Moving on to the test in question:
for (int i = 0; i < n; i++)
{
acc += Point3ControlC.Dot(vControlC, vControlC);
}
And for the generated assembly:
00DC06A7 xor eax,eax //i = 0
00DC06A9 inc eax //++i
00DC06AA cmp eax,989680h //does i == n ?
00DC06AF jl 00DC06A9 //if not, jump to the beginning of the loop
As we can see, the JIT has completely optimized away the call for Point3ControlC.Dot
, so actually, you only pay for the loop, and not for the call itself. hence this "test" finishes first, as it didn't do much to begin with.
Can we say something about structs vs classes from this test alone? well, no. I'm still not quit sure why has the compiler decided to optimize out the call for the struct-function while preserved the other calls. what I'm sure about is that in real-life code, the compiler can not optimize the call away if the result is used. in this mini-benchmark, we don't do much with the result and even if we did, the compiler can calculate the result on compile time. so the compiler can be more aggressive than it could have been than in real-life code.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.