Why are structs so much faster than classes for this specific case?

Question

I have three cases to test the relative performance of classes, classes with inheritence and structs. These are to be used for tight loops so performance counts. Dot products are used as part of many algorithms in 2D and 3D geometry and I have run the profiler on real code. The below tests are indicative of real world performance problems I have seen.

The results for 100000000 times through the loop and application of the dot product gives

ControlA 208 ms   ( class with inheritence )
ControlB 201 ms   ( class with no inheritence )
ControlC 85  ms   ( struct )

The tests were being run without debugging and optimization turned on. My question is, what is it about classes in this case that cause them to be so slow?

I presumed the JIT would still be able to inline all the calls, class or struct, so in effect the results should be identical. Note that if I disable optimizations then my results are identical.

ControlA 3239
ControlB 3228
ControlC 3213

They are always within 20ms of each other if the test is re-run.

The classes under investigation

using System;
using System.Diagnostics;

public class PointControlA
{
    public double X
    {
        get;
        set;
    }

    public double Y
    {
        get;
        set;
    }

    public PointControlA(double x, double y)
    {
        X = x;
        Y = y;
    }
}

public class Point3ControlA : PointControlA
{
    public double Z
    {
        get;
        set;
    }

    public Point3ControlA(double x, double y, double z): base (x, y)
    {
        Z = z;
    }

    public static double Dot(Point3ControlA a, Point3ControlA b)
    {
        return a.X * b.X + a.Y * b.Y + a.Z * b.Z;
    }
}

public class Point3ControlB
{
    public double X
    {
        get;
        set;
    }

    public double Y
    {
        get;
        set;
    }

    public double Z
    {
        get;
        set;
    }

    public Point3ControlB(double x, double y, double z)
    {
        X = x;
        Y = y;
        Z = z;
    }

    public static double Dot(Point3ControlB a, Point3ControlB b)
    {
        return a.X * b.X + a.Y * b.Y + a.Z * b.Z;
    }
}

public struct Point3ControlC
{
    public double X
    {
        get;
        set;
    }

    public double Y
    {
        get;
        set;
    }

    public double Z
    {
        get;
        set;
    }

    public Point3ControlC(double x, double y, double z):this()
    {
        X = x;
        Y = y;
        Z = z;
    }

    public static double Dot(Point3ControlC a, Point3ControlC b)
    {
        return a.X * b.X + a.Y * b.Y + a.Z * b.Z;
    }
}

Test Script

public class Program
{
    public static void TestStructClass()
    {
        var vControlA = new Point3ControlA(11, 12, 13);
        var vControlB = new Point3ControlB(11, 12, 13);
        var vControlC = new Point3ControlC(11, 12, 13);
        var sw = Stopwatch.StartNew();
        var n = 10000000;
        double acc = 0;
        sw = Stopwatch.StartNew();
        for (int i = 0; i < n; i++)
        {
            acc += Point3ControlA.Dot(vControlA, vControlA);
        }

        Console.WriteLine("ControlA " + sw.ElapsedMilliseconds);
        acc = 0;
        sw = Stopwatch.StartNew();
        for (int i = 0; i < n; i++)
        {
            acc += Point3ControlB.Dot(vControlB, vControlB);
        }

        Console.WriteLine("ControlB " + sw.ElapsedMilliseconds);
        acc = 0;
        sw = Stopwatch.StartNew();
        for (int i = 0; i < n; i++)
        {
            acc += Point3ControlC.Dot(vControlC, vControlC);
        }

        Console.WriteLine("ControlC " + sw.ElapsedMilliseconds);
    }

    public static void Main()
    {
        TestStructClass();
    }
}

This dotnet fiddle is proof of compilation only. It does not show the performance differences.

I am trying to explain to a vendor why their choice to use classes instead of structs for small numeric types is a bad idea. I now have the test case to prove it but I can't understand why.

NOTE : I have tried to set a breakpoint in the debugger with JIT optimizations turned on but the debugger will not break. Looking at the IL with JIT optimizations turned off doesn't tell me anything.

EDIT

After the answer by @pkuderov I took his code and played with it. I changed the code and found that if I forced inlining via

   [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static double Dot(Point3Class a)
    {
        return a.X * a.X + a.Y * a.Y + a.Z * a.Z;
    }

the difference between the struct and class for dot product vanished. Why with some setups the attribute is not needed but for me it was is not clear. However I did not give up. There is still a performance problem with the vendor code and I think the DotProduct is not the best example.

I modified @pkuderov's code to implement Vector Add which will create new instances of the structs and classes. The results are here

https://gist.github.com/bradphelan/9b383c8e99edc38068fcc0dccc8a7b48

In the example I also modifed the code to pick a pseudo random vector from an array to avoid the problem of the instances sticking in the registers ( I hope ).

The results show that:

DotProduct performance is identical or maybe faster for classes
Vector Add, and I assume anything that creates a new object is slower.

Add class/class 2777ms Add struct/struct 2457ms

DotProd class/class 1909ms DotProd struct/struct 2108ms

The full code and results are here if anybody wants to try it out.

Edit Again

For the vector add example where an array of vectors is summed together the struct version keeps the accumulator in 3 registers

 var accStruct = new Point3Struct(0, 0, 0);
 for (int i = 0; i < n; i++)
     accStruct = Point3Struct.Add(accStruct, pointStruct[(i + 1) % m]);

the asm body is

// load the next vector into a register
00007FFA3CA2240E  vmovsd      xmm3,qword ptr [rax]  
00007FFA3CA22413  vmovsd      xmm4,qword ptr [rax+8]  
00007FFA3CA22419  vmovsd      xmm5,qword ptr [rax+10h]  
// Sum the accumulator (the accumulator stays in the registers )
00007FFA3CA2241F  vaddsd      xmm0,xmm0,xmm3  
00007FFA3CA22424  vaddsd      xmm1,xmm1,xmm4  
00007FFA3CA22429  vaddsd      xmm2,xmm2,xmm5

but for class based vector version it reads and writes out the accumulator each time to main memory which is inefficient

var accPC = new Point3Class(0, 0, 0);
for (int i = 0; i < n; i++)
    accPC = Point3Class.Add(accPC, pointClass[(i + 1) % m]);

the asm body is

// Read and add both accumulator X and Xnext from main memory
00007FFA3CA2224A  vmovsd      xmm0,qword ptr [r14+8]     
00007FFA3CA22250  vmovaps     xmm7,xmm0                   
00007FFA3CA22255  vaddsd      xmm7,xmm7,mmword ptr [r12+8]  


// Read and add both accumulator Y and Ynext from main memory
00007FFA3CA2225C  vmovsd      xmm0,qword ptr [r14+10h]  
00007FFA3CA22262  vmovaps     xmm8,xmm0  
00007FFA3CA22267  vaddsd      xmm8,xmm8,mmword ptr [r12+10h] 

// Read and add both accumulator Z and Znext from main memory
00007FFA3CA2226E  vmovsd      xmm9,qword ptr [r14+18h]  
00007FFA3CA22283  vmovaps     xmm0,xmm9  
00007FFA3CA22288  vaddsd      xmm0,xmm0,mmword ptr [r12+18h]

// Move accumulator accumulator X,Y,Z back to main memory.
00007FFA3CA2228F  vmovsd      qword ptr [rax+8],xmm7  
00007FFA3CA22295  vmovsd      qword ptr [rax+10h],xmm8  
00007FFA3CA2229B  vmovsd      qword ptr [rax+18h],xmm0

Answer 1

Update

After spending some time thinking about problem I think I'm aggree with @DavidHaim that memory jump overhead is not the case here because of caching.

Also I've added to your tests more options (and removed first one with inheritance). So I have:

cl = variable of class with 3 points:
- Dot(cl, cl) - initial method
- Dot(cl) - which is "square product"
- Dot(cl.X, cl.Y, cl.Z, cl.X, cl.Y, cl.Z) aka Dot(cl.xyz)- pass fields
st = variable of struct with 3 points:
- Dot(st, st) - initial
- Dot(st) - square product
- Dot(st.X, st.Y, st.Z, st.X, st.Y, st.Z) aka Dot(st.xyz) - pass fields
st6 = vairable of struct with 6 points:
- Dot(st6) - wanted to check if size of struct matters
Dot(x, y, z, x, y, z) aka Dot(xyz) - just local const double variables.

Result times are:

Dot(cl.xyz) is the worst ~570ms,
Dot(st6), Dot(st.xyz) is the second worst ~440ms and ~480ms
the others are ~325ms

...And I don't really sure why I see these results.

Maybe for plain primitive types compiler does more aggresive pass by register optimizations, maybe it's more sure of lifetime boundaries or constantness and then more aggressive optimizations again. Maybe some kind of loop unwinding.

I think my expertise is just not enough :) But still, my results counter your results.

Full test code with results on my machine and generated IL code you can find here .

In C# classes are reference types and structs are value types. One major effect is that value types can be (and most of the time are!) allocated on the stack , while reference types are always allocated on the heap.

So every time you get access to the inner state of a reference type variable you need to dereference the pointer to memory in the heap (it's a kind of jump), while for value types it's already on the stack or even optimized out to registers.

I think you see a difference because of this.

PS btw, by "most of the time are" I meant boxing; it's a technique used to place value type objects on the heap (eg to cast value types to an interface or for dynamic method call binding).

Answer 2

As I thought , this test doesn't prove much.

TLDR: the compiler completely optimizes away the call to Point3ControlC.Dot while preserves the calls to the other two. the difference is not because structs are faster in this case, but because you skip the entire calculation part.

My settings:

Visual studio 2015 update 3
.Net framework version 4.6.1
Release mode, Any CPU (my CPU is 64 bit)
Windows 10
CPU: Processor Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz, 2295 Mhz, 2 Core(s), 4 Logical Processor(s)

The generated assembly for

for (int i = 0; i < n; i++)
        {
            acc += Point3ControlA.Dot(vControlA, vControlA);
        }

is:

00DC0573  xor         edx,edx  // temp = 0
00DC0575  mov         dword ptr [ebp-10h],edx // i = temp  
00DC0578  mov         ecx,edi  // load vControlA as first parameter
00DC057A  mov         edx,edi  //load vControlA as second parameter
00DC057C  call        dword ptr ds:[0BA4F0Ch] //call Point3ControlA.Dot
00DC0582  fstp        st(0)  //store the result
00DC0584  inc         dword ptr [ebp-10h]  //i++
00DC0587  cmp         dword ptr [ebp-10h],989680h //does i == n?  
00DC058E  jl          00DC0578  //if not, jump to the begining of the loop

After thoughts:
The JIT compiler for some reason did not use a register for i , so it incremented an integer on the stack ( ebp-10h ) instead. as result, this test has the poorest performance.

Moving on to the second test:

for (int i = 0; i < n; i++)
        {
            acc += Point3ControlC.Dot(vControlC, vControlC);
        }

Generated assembly:

00DC0612  xor         edi,edi  //i = 0
00DC0614  mov         ecx,esi  //load vControlB as the first argument
00DC0616  mov         edx,esi  //load vControlB as the second argument
00DC0618  call        dword ptr ds:[0BA4FD4h] // call Point3ControlB.Dot
00DC061E  fstp        st(0) //store the result  
00DC0620  inc         edi  //++i
00DC0621  cmp         edi,989680h //does i == n
00DC0627  jl          00DC0614  //if not, jump to the beginning of the loop

After thoughts: this generated assembly is almost identical to the first one, but this time, the JIT did use a register for i , hence the minor performance boost over the first test.

Moving on to the test in question:

for (int i = 0; i < n; i++)
        {
            acc += Point3ControlC.Dot(vControlC, vControlC);
        }

And for the generated assembly:

00DC06A7  xor         eax,eax  //i = 0
00DC06A9  inc         eax  //++i
00DC06AA  cmp         eax,989680h //does i == n ?   
00DC06AF  jl          00DC06A9  //if not, jump to the beginning of the loop

As we can see, the JIT has completely optimized away the call for Point3ControlC.Dot , so actually, you only pay for the loop, and not for the call itself. hence this "test" finishes first, as it didn't do much to begin with.

Can we say something about structs vs classes from this test alone? well, no. I'm still not quit sure why has the compiler decided to optimize out the call for the struct-function while preserved the other calls. what I'm sure about is that in real-life code, the compiler can not optimize the call away if the result is used. in this mini-benchmark, we don't do much with the result and even if we did, the compiler can calculate the result on compile time. so the compiler can be more aggressive than it could have been than in real-life code.

Why are structs so much faster than classes for this specific case?

Question

The classes under investigation

Test Script

EDIT

Edit Again

2 answers

solution1
4 2017-07-06 13:00:41

solution2
1 2017-07-08 09:06:08

Why are structs so much faster than classes for this specific case?

Question

The classes under investigation

Test Script

EDIT

Edit Again

2 answers

solution1 4 2017-07-06 13:00:41

solution2 1 2017-07-08 09:06:08

solution1
4 2017-07-06 13:00:41

solution2
1 2017-07-08 09:06:08