简体   繁体   English

为什么结构比这个特定情况下的类快得多?

[英]Why are structs so much faster than classes for this specific case?

I have three cases to test the relative performance of classes, classes with inheritence and structs. 我有三种情况来测试类的相对性能,具有继承性和结构的类。 These are to be used for tight loops so performance counts. 这些将用于紧密循环,因此性能很重要。 Dot products are used as part of many algorithms in 2D and 3D geometry and I have run the profiler on real code. Dot产品被用作2D和3D几何中的许多算法的一部分,我在实际代码上运行了分析器。 The below tests are indicative of real world performance problems I have seen. 以下测试表明我见过的真实世界性能问题。

The results for 100000000 times through the loop and application of the dot product gives 通过循环和点积的应用得到的结果为1亿次

ControlA 208 ms   ( class with inheritence )
ControlB 201 ms   ( class with no inheritence )
ControlC 85  ms   ( struct )

The tests were being run without debugging and optimization turned on. 测试正在运行,没有打开调试和优化。 My question is, what is it about classes in this case that cause them to be so slow? 我的问题是,在这种情况下,类是什么导致它们如此缓慢?

I presumed the JIT would still be able to inline all the calls, class or struct, so in effect the results should be identical. 我假设JIT仍然可以内联所有的调用,类或结构,所以实际上结果应该是相同的。 Note that if I disable optimizations then my results are identical. 请注意,如果我禁用优化,那么我的结果是相同的。

ControlA 3239
ControlB 3228
ControlC 3213

They are always within 20ms of each other if the test is re-run. 如果重新运行测试,它们总是在彼此的20ms内。

The classes under investigation 正在调查的课程

using System;
using System.Diagnostics;

public class PointControlA
{
    public double X
    {
        get;
        set;
    }

    public double Y
    {
        get;
        set;
    }

    public PointControlA(double x, double y)
    {
        X = x;
        Y = y;
    }
}

public class Point3ControlA : PointControlA
{
    public double Z
    {
        get;
        set;
    }

    public Point3ControlA(double x, double y, double z): base (x, y)
    {
        Z = z;
    }

    public static double Dot(Point3ControlA a, Point3ControlA b)
    {
        return a.X * b.X + a.Y * b.Y + a.Z * b.Z;
    }
}

public class Point3ControlB
{
    public double X
    {
        get;
        set;
    }

    public double Y
    {
        get;
        set;
    }

    public double Z
    {
        get;
        set;
    }

    public Point3ControlB(double x, double y, double z)
    {
        X = x;
        Y = y;
        Z = z;
    }

    public static double Dot(Point3ControlB a, Point3ControlB b)
    {
        return a.X * b.X + a.Y * b.Y + a.Z * b.Z;
    }
}

public struct Point3ControlC
{
    public double X
    {
        get;
        set;
    }

    public double Y
    {
        get;
        set;
    }

    public double Z
    {
        get;
        set;
    }

    public Point3ControlC(double x, double y, double z):this()
    {
        X = x;
        Y = y;
        Z = z;
    }

    public static double Dot(Point3ControlC a, Point3ControlC b)
    {
        return a.X * b.X + a.Y * b.Y + a.Z * b.Z;
    }
}

Test Script 测试脚本

public class Program
{
    public static void TestStructClass()
    {
        var vControlA = new Point3ControlA(11, 12, 13);
        var vControlB = new Point3ControlB(11, 12, 13);
        var vControlC = new Point3ControlC(11, 12, 13);
        var sw = Stopwatch.StartNew();
        var n = 10000000;
        double acc = 0;
        sw = Stopwatch.StartNew();
        for (int i = 0; i < n; i++)
        {
            acc += Point3ControlA.Dot(vControlA, vControlA);
        }

        Console.WriteLine("ControlA " + sw.ElapsedMilliseconds);
        acc = 0;
        sw = Stopwatch.StartNew();
        for (int i = 0; i < n; i++)
        {
            acc += Point3ControlB.Dot(vControlB, vControlB);
        }

        Console.WriteLine("ControlB " + sw.ElapsedMilliseconds);
        acc = 0;
        sw = Stopwatch.StartNew();
        for (int i = 0; i < n; i++)
        {
            acc += Point3ControlC.Dot(vControlC, vControlC);
        }

        Console.WriteLine("ControlC " + sw.ElapsedMilliseconds);
    }

    public static void Main()
    {
        TestStructClass();
    }
}

This dotnet fiddle is proof of compilation only. 这个dotnet小提琴只是编译的证明。 It does not show the performance differences. 它没有显示性能差异。

I am trying to explain to a vendor why their choice to use classes instead of structs for small numeric types is a bad idea. 我试图向供应商解释为什么他们选择使用类而不是小数字类型的结构是一个主意。 I now have the test case to prove it but I can't understand why. 我现在有测试用例来证明它,但我不明白为什么。

NOTE : I have tried to set a breakpoint in the debugger with JIT optimizations turned on but the debugger will not break. 注意 :我已尝试在调试器中设置断点,并启用JIT优化,但调试器不会中断。 Looking at the IL with JIT optimizations turned off doesn't tell me anything. 在关闭JIT优化的情况下查看IL并没有告诉我什么。

EDIT 编辑

After the answer by @pkuderov I took his code and played with it. 在@pkuderov的回答之后,我拿了他的代码并玩了它。 I changed the code and found that if I forced inlining via 我改变了代码,发现如果我强迫通过内联

   [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static double Dot(Point3Class a)
    {
        return a.X * a.X + a.Y * a.Y + a.Z * a.Z;
    }

the difference between the struct and class for dot product vanished. 点积的结构和类之间的差异消失了。 Why with some setups the attribute is not needed but for me it was is not clear. 为什么有些设置不需要属性,但对我来说,目前尚不清楚。 However I did not give up. 但是我并没有放弃。 There is still a performance problem with the vendor code and I think the DotProduct is not the best example. 供应商代码仍然存在性能问题,我认为DotProduct不是最好的例子。

I modified @pkuderov's code to implement Vector Add which will create new instances of the structs and classes. 我修改了@pkuderov的代码来实现Vector Add ,它将创建结构和类的新实例。 The results are here 结果在这里

https://gist.github.com/bradphelan/9b383c8e99edc38068fcc0dccc8a7b48 https://gist.github.com/bradphelan/9b383c8e99edc38068fcc0dccc8a7b48

In the example I also modifed the code to pick a pseudo random vector from an array to avoid the problem of the instances sticking in the registers ( I hope ). 在示例中,我还修改了代码以从数组中选择伪随机向量,以避免实例在寄存器中出现问题(我希望)。

The results show that: 结果表明:

DotProduct performance is identical or maybe faster for classes 对于类,DotProduct性能相同或更快
Vector Add, and I assume anything that creates a new object is slower. Vector Add,我假设任何创建新对象的东西都比较慢。

Add class/class 2777ms Add struct/struct 2457ms 添加类/类2777ms添加struct / struct 2457ms

DotProd class/class 1909ms DotProd struct/struct 2108ms DotProd类/类1909ms DotProd struct / struct 2108ms

The full code and results are here if anybody wants to try it out. 如果有人想试试,那么完整的代码和结果就在这里

Edit Again 再次编辑

For the vector add example where an array of vectors is summed together the struct version keeps the accumulator in 3 registers 对于向量添加示例,其中向量数组被加在一起,结构版本将累加器保持在3个寄存器中

 var accStruct = new Point3Struct(0, 0, 0);
 for (int i = 0; i < n; i++)
     accStruct = Point3Struct.Add(accStruct, pointStruct[(i + 1) % m]);

the asm body is asm的身体是

// load the next vector into a register
00007FFA3CA2240E  vmovsd      xmm3,qword ptr [rax]  
00007FFA3CA22413  vmovsd      xmm4,qword ptr [rax+8]  
00007FFA3CA22419  vmovsd      xmm5,qword ptr [rax+10h]  
// Sum the accumulator (the accumulator stays in the registers )
00007FFA3CA2241F  vaddsd      xmm0,xmm0,xmm3  
00007FFA3CA22424  vaddsd      xmm1,xmm1,xmm4  
00007FFA3CA22429  vaddsd      xmm2,xmm2,xmm5  

but for class based vector version it reads and writes out the accumulator each time to main memory which is inefficient 但对于基于类的矢量版本,它每次都会读取和写出累加器到主存,这是低效的

var accPC = new Point3Class(0, 0, 0);
for (int i = 0; i < n; i++)
    accPC = Point3Class.Add(accPC, pointClass[(i + 1) % m]);

the asm body is asm的身体是

// Read and add both accumulator X and Xnext from main memory
00007FFA3CA2224A  vmovsd      xmm0,qword ptr [r14+8]     
00007FFA3CA22250  vmovaps     xmm7,xmm0                   
00007FFA3CA22255  vaddsd      xmm7,xmm7,mmword ptr [r12+8]  


// Read and add both accumulator Y and Ynext from main memory
00007FFA3CA2225C  vmovsd      xmm0,qword ptr [r14+10h]  
00007FFA3CA22262  vmovaps     xmm8,xmm0  
00007FFA3CA22267  vaddsd      xmm8,xmm8,mmword ptr [r12+10h] 

// Read and add both accumulator Z and Znext from main memory
00007FFA3CA2226E  vmovsd      xmm9,qword ptr [r14+18h]  
00007FFA3CA22283  vmovaps     xmm0,xmm9  
00007FFA3CA22288  vaddsd      xmm0,xmm0,mmword ptr [r12+18h]

// Move accumulator accumulator X,Y,Z back to main memory.
00007FFA3CA2228F  vmovsd      qword ptr [rax+8],xmm7  
00007FFA3CA22295  vmovsd      qword ptr [rax+10h],xmm8  
00007FFA3CA2229B  vmovsd      qword ptr [rax+18h],xmm0  

Update 更新

After spending some time thinking about problem I think I'm aggree with @DavidHaim that memory jump overhead is not the case here because of caching. 在花了一些时间思考问题之后,我认为我对@DavidHaim的委任是因为缓存导致内存跳跃开销并非如此。

Also I've added to your tests more options (and removed first one with inheritance). 此外,我已经为您的测试添加了更多选项(并删除了第一个继承)。 So I have: 所以我有:

  • cl = variable of class with 3 points: cl = 3级变量:
    • Dot(cl, cl) - initial method Dot(cl, cl) - 初始方法
    • Dot(cl) - which is "square product" Dot(cl) - 这是“方形产品”
    • Dot(cl.X, cl.Y, cl.Z, cl.X, cl.Y, cl.Z) aka Dot(cl.xyz)- pass fields Dot(cl.X, cl.Y, cl.Z, cl.X, cl.Y, cl.Z) aka Dot(cl.xyz) - 通过字段
  • st = variable of struct with 3 points: st = 3个点的struct变量:
    • Dot(st, st) - initial Dot(st, st) - 初始
    • Dot(st) - square product Dot(st) - 方形产品
    • Dot(st.X, st.Y, st.Z, st.X, st.Y, st.Z) aka Dot(st.xyz) - pass fields Dot(st.X, st.Y, st.Z, st.X, st.Y, st.Z) aka Dot(st.xyz) - 传递字段
  • st6 = vairable of struct with 6 points: st6 = 6点结构可变:
    • Dot(st6) - wanted to check if size of struct matters Dot(st6) - 想检查struct的大小是否重要
  • Dot(x, y, z, x, y, z) aka Dot(xyz) - just local const double variables. Dot(x, y, z, x, y, z)又称Dot(xyz) - 只是局部const双变量。

Result times are: 结果时间是:

  • Dot(cl.xyz) is the worst ~570ms, 点(cl.xyz)是最差的~570ms,
  • Dot(st6), Dot(st.xyz) is the second worst ~440ms and ~480ms Dot(st6),Dot(st.xyz)是第二差的~440ms和~480ms
  • the others are ~325ms 其他人是~322ms

...And I don't really sure why I see these results. ......我不确定为什么会看到这些结果。

Maybe for plain primitive types compiler does more aggresive pass by register optimizations, maybe it's more sure of lifetime boundaries or constantness and then more aggressive optimizations again. 也许对于普通的原始类型,编译器会通过寄存器优化进行更具侵略性的传递,也许它可以更加确定生命周期边界或常量,然后再进行更积极的优化。 Maybe some kind of loop unwinding. 也许是某种循环展开。

I think my expertise is just not enough :) But still, my results counter your results. 我认为我的专业知识还不够:)但是,我的结果反驳了你的结果。

Full test code with results on my machine and generated IL code you can find here . 完整的测试代码包含我的机器上的结果和生成的IL代码,您可以在此处找到。


In C# classes are reference types and structs are value types. 在C#类中,引用类型和结构是值类型。 One major effect is that value types can be (and most of the time are!) allocated on the stack , while reference types are always allocated on the heap. 一个主要的影响是值类型可以(并且大部分时间都是!) 在堆栈上分配 ,而引用类型总是在堆上分配。

So every time you get access to the inner state of a reference type variable you need to dereference the pointer to memory in the heap (it's a kind of jump), while for value types it's already on the stack or even optimized out to registers. 因此,每次访问引用类型变量的内部状态时,都需要取消引用堆中的内存指针(这是一种跳转),而对于值类型,它已经在堆栈中,甚至优化到寄存器。

I think you see a difference because of this. 我认为你因此而看到了不同之处。

PS btw, by "most of the time are" I meant boxing; PS btw,“大部分时间都是”我的意思是拳击; it's a technique used to place value type objects on the heap (eg to cast value types to an interface or for dynamic method call binding). 它是一种用于在堆上放置值类型对象的技术(例如,将值类型转换为接口或用于动态方法调用绑定)。

As I thought , this test doesn't prove much. 正如我所想,这个测试证明并不多。

TLDR: the compiler completely optimizes away the call to Point3ControlC.Dot while preserves the calls to the other two. TLDR:编译器完全优化了对Point3ControlC.Dot的调用,同时保留了对其他两个调用的调用。 the difference is not because structs are faster in this case, but because you skip the entire calculation part. 差异不是因为结构在这种情况下更快,而是因为你跳过了整个计算部分。

My settings: 我的设置:

  • Visual studio 2015 update 3 Visual Studio 2015更新3
  • .Net framework version 4.6.1 .Net框架版本4.6.1
  • Release mode, Any CPU (my CPU is 64 bit) 发布模式,任何CPU(我的CPU为64位)
  • Windows 10 Windows 10
  • CPU: Processor Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz, 2295 Mhz, 2 Core(s), 4 Logical Processor(s) CPU:处理器Intel(R)Core(TM)i5-5300U CPU @ 2.30GHz,2295 Mhz,2 Core(s),4个逻辑处理器

The generated assembly for 生成的程序集

for (int i = 0; i < n; i++)
        {
            acc += Point3ControlA.Dot(vControlA, vControlA);
        }

is: 是:

00DC0573  xor         edx,edx  // temp = 0
00DC0575  mov         dword ptr [ebp-10h],edx // i = temp  
00DC0578  mov         ecx,edi  // load vControlA as first parameter
00DC057A  mov         edx,edi  //load vControlA as second parameter
00DC057C  call        dword ptr ds:[0BA4F0Ch] //call Point3ControlA.Dot
00DC0582  fstp        st(0)  //store the result
00DC0584  inc         dword ptr [ebp-10h]  //i++
00DC0587  cmp         dword ptr [ebp-10h],989680h //does i == n?  
00DC058E  jl          00DC0578  //if not, jump to the begining of the loop

After thoughts: 经过思考:
The JIT compiler for some reason did not use a register for i , so it incremented an integer on the stack ( ebp-10h ) instead. 由于某种原因,JIT编译器没有使用i的寄存器,因此它在堆栈( ebp-10h )上增加了一个整数。 as result, this test has the poorest performance. 结果,该测试具有最差的性能。

Moving on to the second test: 继续第二次测试:

for (int i = 0; i < n; i++)
        {
            acc += Point3ControlC.Dot(vControlC, vControlC);
        }

Generated assembly: 生成的程序集:

00DC0612  xor         edi,edi  //i = 0
00DC0614  mov         ecx,esi  //load vControlB as the first argument
00DC0616  mov         edx,esi  //load vControlB as the second argument
00DC0618  call        dword ptr ds:[0BA4FD4h] // call Point3ControlB.Dot
00DC061E  fstp        st(0) //store the result  
00DC0620  inc         edi  //++i
00DC0621  cmp         edi,989680h //does i == n
00DC0627  jl          00DC0614  //if not, jump to the beginning of the loop     

After thoughts: this generated assembly is almost identical to the first one, but this time, the JIT did use a register for i , hence the minor performance boost over the first test. 想一想:这个生成的程序集几乎与第一个程序集相同,但这次,JIT确实使用了i的寄存器,因此在第一次测试时性能略有提升。

Moving on to the test in question: 继续进行有问题的测试:

for (int i = 0; i < n; i++)
        {
            acc += Point3ControlC.Dot(vControlC, vControlC);
        }

And for the generated assembly: 并为生成的程序集:

00DC06A7  xor         eax,eax  //i = 0
00DC06A9  inc         eax  //++i
00DC06AA  cmp         eax,989680h //does i == n ?   
00DC06AF  jl          00DC06A9  //if not, jump to the beginning of the loop

As we can see, the JIT has completely optimized away the call for Point3ControlC.Dot , so actually, you only pay for the loop, and not for the call itself. 正如我们所看到的,JIT已完全优化了对Point3ControlC.Dot的调用,实际上,您只需为循环付费,而不是为调用本身付费。 hence this "test" finishes first, as it didn't do much to begin with. 因此,这个“测试”首先完成,因为它开始没什么用。

Can we say something about structs vs classes from this test alone? 我们可以单独从这个测试中对结构与类进行一些说法吗? well, no. 好吧,没有。 I'm still not quit sure why has the compiler decided to optimize out the call for the struct-function while preserved the other calls. 我仍然没有放弃为什么编译器决定优化结构函数的调用同时保留其他调用。 what I'm sure about is that in real-life code, the compiler can not optimize the call away if the result is used. 我确信的是,在现实代码中,如果使用结果,编译器无法优化调用。 in this mini-benchmark, we don't do much with the result and even if we did, the compiler can calculate the result on compile time. 在这个迷你基准测试中,我们对结果做了很多工作,即使我们这样做了,编译器也可以在编译时计算结果。 so the compiler can be more aggressive than it could have been than in real-life code. 所以编译器可能比实际代码更具侵略性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM