[英]Why are structs so much faster than classes for this specific case?
I have three cases to test the relative performance of classes, classes with inheritence and structs. 我有三种情况来测试类的相对性能,具有继承性和结构的类。 These are to be used for tight loops so performance counts.
这些将用于紧密循环,因此性能很重要。 Dot products are used as part of many algorithms in 2D and 3D geometry and I have run the profiler on real code.
Dot产品被用作2D和3D几何中的许多算法的一部分,我在实际代码上运行了分析器。 The below tests are indicative of real world performance problems I have seen.
以下测试表明我见过的真实世界性能问题。
The results for 100000000 times through the loop and application of the dot product gives 通过循环和点积的应用得到的结果为1亿次
ControlA 208 ms ( class with inheritence )
ControlB 201 ms ( class with no inheritence )
ControlC 85 ms ( struct )
The tests were being run without debugging and optimization turned on. 测试正在运行,没有打开调试和优化。 My question is, what is it about classes in this case that cause them to be so slow?
我的问题是,在这种情况下,类是什么导致它们如此缓慢?
I presumed the JIT would still be able to inline all the calls, class or struct, so in effect the results should be identical. 我假设JIT仍然可以内联所有的调用,类或结构,所以实际上结果应该是相同的。 Note that if I disable optimizations then my results are identical.
请注意,如果我禁用优化,那么我的结果是相同的。
ControlA 3239
ControlB 3228
ControlC 3213
They are always within 20ms of each other if the test is re-run. 如果重新运行测试,它们总是在彼此的20ms内。
using System;
using System.Diagnostics;
public class PointControlA
{
public double X
{
get;
set;
}
public double Y
{
get;
set;
}
public PointControlA(double x, double y)
{
X = x;
Y = y;
}
}
public class Point3ControlA : PointControlA
{
public double Z
{
get;
set;
}
public Point3ControlA(double x, double y, double z): base (x, y)
{
Z = z;
}
public static double Dot(Point3ControlA a, Point3ControlA b)
{
return a.X * b.X + a.Y * b.Y + a.Z * b.Z;
}
}
public class Point3ControlB
{
public double X
{
get;
set;
}
public double Y
{
get;
set;
}
public double Z
{
get;
set;
}
public Point3ControlB(double x, double y, double z)
{
X = x;
Y = y;
Z = z;
}
public static double Dot(Point3ControlB a, Point3ControlB b)
{
return a.X * b.X + a.Y * b.Y + a.Z * b.Z;
}
}
public struct Point3ControlC
{
public double X
{
get;
set;
}
public double Y
{
get;
set;
}
public double Z
{
get;
set;
}
public Point3ControlC(double x, double y, double z):this()
{
X = x;
Y = y;
Z = z;
}
public static double Dot(Point3ControlC a, Point3ControlC b)
{
return a.X * b.X + a.Y * b.Y + a.Z * b.Z;
}
}
public class Program
{
public static void TestStructClass()
{
var vControlA = new Point3ControlA(11, 12, 13);
var vControlB = new Point3ControlB(11, 12, 13);
var vControlC = new Point3ControlC(11, 12, 13);
var sw = Stopwatch.StartNew();
var n = 10000000;
double acc = 0;
sw = Stopwatch.StartNew();
for (int i = 0; i < n; i++)
{
acc += Point3ControlA.Dot(vControlA, vControlA);
}
Console.WriteLine("ControlA " + sw.ElapsedMilliseconds);
acc = 0;
sw = Stopwatch.StartNew();
for (int i = 0; i < n; i++)
{
acc += Point3ControlB.Dot(vControlB, vControlB);
}
Console.WriteLine("ControlB " + sw.ElapsedMilliseconds);
acc = 0;
sw = Stopwatch.StartNew();
for (int i = 0; i < n; i++)
{
acc += Point3ControlC.Dot(vControlC, vControlC);
}
Console.WriteLine("ControlC " + sw.ElapsedMilliseconds);
}
public static void Main()
{
TestStructClass();
}
}
This dotnet fiddle is proof of compilation only. 这个dotnet小提琴只是编译的证明。 It does not show the performance differences.
它没有显示性能差异。
I am trying to explain to a vendor why their choice to use classes instead of structs for small numeric types is a bad idea. 我试图向供应商解释为什么他们选择使用类而不是小数字类型的结构是一个坏主意。 I now have the test case to prove it but I can't understand why.
我现在有测试用例来证明它,但我不明白为什么。
NOTE : I have tried to set a breakpoint in the debugger with JIT optimizations turned on but the debugger will not break. 注意 :我已尝试在调试器中设置断点,并启用JIT优化,但调试器不会中断。 Looking at the IL with JIT optimizations turned off doesn't tell me anything.
在关闭JIT优化的情况下查看IL并没有告诉我什么。
After the answer by @pkuderov I took his code and played with it. 在@pkuderov的回答之后,我拿了他的代码并玩了它。 I changed the code and found that if I forced inlining via
我改变了代码,发现如果我强迫通过内联
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static double Dot(Point3Class a)
{
return a.X * a.X + a.Y * a.Y + a.Z * a.Z;
}
the difference between the struct and class for dot product vanished. 点积的结构和类之间的差异消失了。 Why with some setups the attribute is not needed but for me it was is not clear.
为什么有些设置不需要属性,但对我来说,目前尚不清楚。 However I did not give up.
但是我并没有放弃。 There is still a performance problem with the vendor code and I think the DotProduct is not the best example.
供应商代码仍然存在性能问题,我认为DotProduct不是最好的例子。
I modified @pkuderov's code to implement Vector Add
which will create new instances of the structs and classes. 我修改了@pkuderov的代码来实现
Vector Add
,它将创建结构和类的新实例。 The results are here 结果在这里
https://gist.github.com/bradphelan/9b383c8e99edc38068fcc0dccc8a7b48 https://gist.github.com/bradphelan/9b383c8e99edc38068fcc0dccc8a7b48
In the example I also modifed the code to pick a pseudo random vector from an array to avoid the problem of the instances sticking in the registers ( I hope ). 在示例中,我还修改了代码以从数组中选择伪随机向量,以避免实例在寄存器中出现问题(我希望)。
The results show that: 结果表明:
DotProduct performance is identical or maybe faster for classes 对于类,DotProduct性能相同或更快
Vector Add, and I assume anything that creates a new object is slower. Vector Add,我假设任何创建新对象的东西都比较慢。
Add class/class 2777ms Add struct/struct 2457ms 添加类/类2777ms添加struct / struct 2457ms
DotProd class/class 1909ms DotProd struct/struct 2108ms DotProd类/类1909ms DotProd struct / struct 2108ms
The full code and results are here if anybody wants to try it out. 如果有人想试试,那么完整的代码和结果就在这里 。
For the vector add example where an array of vectors is summed together the struct version keeps the accumulator in 3 registers 对于向量添加示例,其中向量数组被加在一起,结构版本将累加器保持在3个寄存器中
var accStruct = new Point3Struct(0, 0, 0);
for (int i = 0; i < n; i++)
accStruct = Point3Struct.Add(accStruct, pointStruct[(i + 1) % m]);
the asm body is asm的身体是
// load the next vector into a register
00007FFA3CA2240E vmovsd xmm3,qword ptr [rax]
00007FFA3CA22413 vmovsd xmm4,qword ptr [rax+8]
00007FFA3CA22419 vmovsd xmm5,qword ptr [rax+10h]
// Sum the accumulator (the accumulator stays in the registers )
00007FFA3CA2241F vaddsd xmm0,xmm0,xmm3
00007FFA3CA22424 vaddsd xmm1,xmm1,xmm4
00007FFA3CA22429 vaddsd xmm2,xmm2,xmm5
but for class based vector version it reads and writes out the accumulator each time to main memory which is inefficient 但对于基于类的矢量版本,它每次都会读取和写出累加器到主存,这是低效的
var accPC = new Point3Class(0, 0, 0);
for (int i = 0; i < n; i++)
accPC = Point3Class.Add(accPC, pointClass[(i + 1) % m]);
the asm body is asm的身体是
// Read and add both accumulator X and Xnext from main memory
00007FFA3CA2224A vmovsd xmm0,qword ptr [r14+8]
00007FFA3CA22250 vmovaps xmm7,xmm0
00007FFA3CA22255 vaddsd xmm7,xmm7,mmword ptr [r12+8]
// Read and add both accumulator Y and Ynext from main memory
00007FFA3CA2225C vmovsd xmm0,qword ptr [r14+10h]
00007FFA3CA22262 vmovaps xmm8,xmm0
00007FFA3CA22267 vaddsd xmm8,xmm8,mmword ptr [r12+10h]
// Read and add both accumulator Z and Znext from main memory
00007FFA3CA2226E vmovsd xmm9,qword ptr [r14+18h]
00007FFA3CA22283 vmovaps xmm0,xmm9
00007FFA3CA22288 vaddsd xmm0,xmm0,mmword ptr [r12+18h]
// Move accumulator accumulator X,Y,Z back to main memory.
00007FFA3CA2228F vmovsd qword ptr [rax+8],xmm7
00007FFA3CA22295 vmovsd qword ptr [rax+10h],xmm8
00007FFA3CA2229B vmovsd qword ptr [rax+18h],xmm0
Update 更新
After spending some time thinking about problem I think I'm aggree with @DavidHaim that memory jump overhead is not the case here because of caching. 在花了一些时间思考问题之后,我认为我对@DavidHaim的委任是因为缓存导致内存跳跃开销并非如此。
Also I've added to your tests more options (and removed first one with inheritance). 此外,我已经为您的测试添加了更多选项(并删除了第一个继承)。 So I have:
所以我有:
Dot(cl, cl)
- initial method Dot(cl, cl)
- 初始方法 Dot(cl)
- which is "square product" Dot(cl)
- 这是“方形产品” Dot(cl.X, cl.Y, cl.Z, cl.X, cl.Y, cl.Z)
aka Dot(cl.xyz)- pass fields Dot(cl.X, cl.Y, cl.Z, cl.X, cl.Y, cl.Z)
aka Dot(cl.xyz) - 通过字段 Dot(st, st)
- initial Dot(st, st)
- 初始 Dot(st)
- square product Dot(st)
- 方形产品 Dot(st.X, st.Y, st.Z, st.X, st.Y, st.Z)
aka Dot(st.xyz) - pass fields Dot(st.X, st.Y, st.Z, st.X, st.Y, st.Z)
aka Dot(st.xyz) - 传递字段 Dot(st6)
- wanted to check if size of struct matters Dot(st6)
- 想检查struct的大小是否重要 Dot(x, y, z, x, y, z)
aka Dot(xyz) - just local const double variables. Dot(x, y, z, x, y, z)
又称Dot(xyz) - 只是局部const双变量。 Result times are: 结果时间是:
...And I don't really sure why I see these results. ......我不确定为什么会看到这些结果。
Maybe for plain primitive types compiler does more aggresive pass by register optimizations, maybe it's more sure of lifetime boundaries or constantness and then more aggressive optimizations again. 也许对于普通的原始类型,编译器会通过寄存器优化进行更具侵略性的传递,也许它可以更加确定生命周期边界或常量,然后再进行更积极的优化。 Maybe some kind of loop unwinding.
也许是某种循环展开。
I think my expertise is just not enough :) But still, my results counter your results. 我认为我的专业知识还不够:)但是,我的结果反驳了你的结果。
Full test code with results on my machine and generated IL code you can find here . 完整的测试代码包含我的机器上的结果和生成的IL代码,您可以在此处找到。
In C# classes are reference types and structs are value types. 在C#类中,引用类型和结构是值类型。 One major effect is that value types can be (and most of the time are!) allocated on the stack , while reference types are always allocated on the heap.
一个主要的影响是值类型可以(并且大部分时间都是!) 在堆栈上分配 ,而引用类型总是在堆上分配。
So every time you get access to the inner state of a reference type variable you need to dereference the pointer to memory in the heap (it's a kind of jump), while for value types it's already on the stack or even optimized out to registers. 因此,每次访问引用类型变量的内部状态时,都需要取消引用堆中的内存指针(这是一种跳转),而对于值类型,它已经在堆栈中,甚至优化到寄存器。
I think you see a difference because of this. 我认为你因此而看到了不同之处。
PS btw, by "most of the time are" I meant boxing; PS btw,“大部分时间都是”我的意思是拳击; it's a technique used to place value type objects on the heap (eg to cast value types to an interface or for dynamic method call binding).
它是一种用于在堆上放置值类型对象的技术(例如,将值类型转换为接口或用于动态方法调用绑定)。
As I thought , this test doesn't prove much. 正如我所想,这个测试证明并不多。
TLDR: the compiler completely optimizes away the call to Point3ControlC.Dot
while preserves the calls to the other two. TLDR:编译器完全优化了对
Point3ControlC.Dot
的调用,同时保留了对其他两个调用的调用。 the difference is not because structs are faster in this case, but because you skip the entire calculation part. 差异不是因为结构在这种情况下更快,而是因为你跳过了整个计算部分。
My settings: 我的设置:
The generated assembly for 生成的程序集
for (int i = 0; i < n; i++)
{
acc += Point3ControlA.Dot(vControlA, vControlA);
}
is: 是:
00DC0573 xor edx,edx // temp = 0
00DC0575 mov dword ptr [ebp-10h],edx // i = temp
00DC0578 mov ecx,edi // load vControlA as first parameter
00DC057A mov edx,edi //load vControlA as second parameter
00DC057C call dword ptr ds:[0BA4F0Ch] //call Point3ControlA.Dot
00DC0582 fstp st(0) //store the result
00DC0584 inc dword ptr [ebp-10h] //i++
00DC0587 cmp dword ptr [ebp-10h],989680h //does i == n?
00DC058E jl 00DC0578 //if not, jump to the begining of the loop
After thoughts: 经过思考:
The JIT compiler for some reason did not use a register for i
, so it incremented an integer on the stack ( ebp-10h
) instead. 由于某种原因,JIT编译器没有使用
i
的寄存器,因此它在堆栈( ebp-10h
)上增加了一个整数。 as result, this test has the poorest performance. 结果,该测试具有最差的性能。
Moving on to the second test: 继续第二次测试:
for (int i = 0; i < n; i++)
{
acc += Point3ControlC.Dot(vControlC, vControlC);
}
Generated assembly: 生成的程序集:
00DC0612 xor edi,edi //i = 0
00DC0614 mov ecx,esi //load vControlB as the first argument
00DC0616 mov edx,esi //load vControlB as the second argument
00DC0618 call dword ptr ds:[0BA4FD4h] // call Point3ControlB.Dot
00DC061E fstp st(0) //store the result
00DC0620 inc edi //++i
00DC0621 cmp edi,989680h //does i == n
00DC0627 jl 00DC0614 //if not, jump to the beginning of the loop
After thoughts: this generated assembly is almost identical to the first one, but this time, the JIT did use a register for i
, hence the minor performance boost over the first test. 想一想:这个生成的程序集几乎与第一个程序集相同,但这次,JIT确实使用了
i
的寄存器,因此在第一次测试时性能略有提升。
Moving on to the test in question: 继续进行有问题的测试:
for (int i = 0; i < n; i++)
{
acc += Point3ControlC.Dot(vControlC, vControlC);
}
And for the generated assembly: 并为生成的程序集:
00DC06A7 xor eax,eax //i = 0
00DC06A9 inc eax //++i
00DC06AA cmp eax,989680h //does i == n ?
00DC06AF jl 00DC06A9 //if not, jump to the beginning of the loop
As we can see, the JIT has completely optimized away the call for Point3ControlC.Dot
, so actually, you only pay for the loop, and not for the call itself. 正如我们所看到的,JIT已完全优化了对
Point3ControlC.Dot
的调用,实际上,您只需为循环付费,而不是为调用本身付费。 hence this "test" finishes first, as it didn't do much to begin with. 因此,这个“测试”首先完成,因为它开始没什么用。
Can we say something about structs vs classes from this test alone? 我们可以单独从这个测试中对结构与类进行一些说法吗? well, no.
好吧,没有。 I'm still not quit sure why has the compiler decided to optimize out the call for the struct-function while preserved the other calls.
我仍然没有放弃为什么编译器决定优化结构函数的调用同时保留其他调用。 what I'm sure about is that in real-life code, the compiler can not optimize the call away if the result is used.
我确信的是,在现实代码中,如果使用结果,编译器无法优化调用。 in this mini-benchmark, we don't do much with the result and even if we did, the compiler can calculate the result on compile time.
在这个迷你基准测试中,我们对结果做了很多工作,即使我们这样做了,编译器也可以在编译时计算结果。 so the compiler can be more aggressive than it could have been than in real-life code.
所以编译器可能比实际代码更具侵略性。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.