[英]Poor C# optimizer performance?
I've just written a small example checking, how C#'s optimizer behaves in case of indexers. 我刚刚写了一个小例子检查,C#的优化器在索引器的情况下表现如何。 The example is simple - I just wrap an array in a class and try to fill its values: once directly and once by indexer (which internally accesses the data exactly the same way as the direct solution does).
这个例子很简单 - 我只是将一个数组包装在一个类中并尝试填充它的值:一次直接和一次索引器(内部访问数据的方式与直接解决方案完全相同)。
public class ArrayWrapper
{
public ArrayWrapper(int newWidth, int newHeight)
{
width = newWidth;
height = newHeight;
data = new int[width * height];
}
public int this[int x, int y]
{
get
{
return data[y * width + x];
}
set
{
data[y * width + x] = value;
}
}
public readonly int width, height;
public readonly int[] data;
}
public class Program
{
public static void Main(string[] args)
{
ArrayWrapper bigArray = new ArrayWrapper(15000, 15000);
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
for (int y = 0; y < bigArray.height; y++)
for (int x = 0; x < bigArray.width; x++)
bigArray.data[y * bigArray.width + x] = 12;
stopwatch.Stop();
Console.WriteLine(String.Format("Directly: {0} ms", stopwatch.ElapsedMilliseconds));
stopwatch.Restart();
for (int y = 0; y < bigArray.height; y++)
for (int x = 0; x < bigArray.width; x++)
bigArray[x, y] = 12;
stopwatch.Stop();
Console.WriteLine(String.Format("Via indexer: {0} ms", stopwatch.ElapsedMilliseconds));
Console.ReadKey();
}
}
Many SO posts taught me, that a programmer should highly trust optimizer to do its job. 许多SO帖子告诉我,程序员应该高度信任优化者来完成它的工作。 But in this case results are quite surprising:
但在这种情况下,结果非常令人惊讶:
Directly: 1282 ms
Via indexer: 2134 ms
(Compiled in Release configuration with the optimizations on, I double-checked). (在发布配置中编译并进行优化,我仔细检查)。
That's a huge difference - no way being a statistical error (and it's both scalable and repeatable). 这是一个巨大的差异 - 不是一个统计错误(它既可扩展又可重复)。
It's a very unpleasant surprise: in this case I'd expect the compiler to inline the indexer (it even does not include any range-checking), but it didn't do it. 这是一个非常令人不快的惊喜:在这种情况下,我希望编译器内联索引器(它甚至不包括任何范围检查),但它没有这样做。 Here's the disassembly (note, that my comments are guesses on what is going on):
这是反汇编(请注意,我的评论是关于正在发生的事情的猜测 ):
bigArray.data[y * bigArray.width + x] = 12;
000000a2 mov eax,dword ptr [ebp-3Ch] // Evaluate index of array
000000a5 mov eax,dword ptr [eax+4]
000000a8 mov edx,dword ptr [ebp-3Ch]
000000ab mov edx,dword ptr [edx+8]
000000ae imul edx,dword ptr [ebp-10h]
000000b2 add edx,dword ptr [ebp-14h] // ...until here
000000b5 cmp edx,dword ptr [eax+4] // Range checking
000000b8 jb 000000BF
000000ba call 6ED23CF5 // Throw IndexOutOfRange
000000bf mov dword ptr [eax+edx*4+8],0Ch // Assign value to array
bigArray[x, y] = 12;
0000015e push dword ptr [ebp-18h] // Push x and y
00000161 push 0Ch // (prepare parameters)
00000163 mov ecx,dword ptr [ebp-3Ch]
00000166 mov edx,dword ptr [ebp-1Ch]
00000169 cmp dword ptr [ecx],ecx
0000016b call dword ptr ds:[004B27DCh] // Call the indexer
(...)
data[y * width + x] = value;
00000000 push ebp
00000001 mov ebp,esp
00000003 sub esp,8
00000006 mov dword ptr [ebp-8],ecx
00000009 mov dword ptr [ebp-4],edx
0000000c cmp dword ptr ds:[004B171Ch],0 // Some additional checking, I guess?
00000013 je 0000001A
00000015 call 6ED24648
0000001a mov eax,dword ptr [ebp-8] // Evaluating index
0000001d mov eax,dword ptr [eax+4]
00000020 mov edx,dword ptr [ebp-8]
00000023 mov edx,dword ptr [edx+8]
00000026 imul edx,dword ptr [ebp+0Ch]
0000002a add edx,dword ptr [ebp-4] // ...until here
0000002d cmp edx,dword ptr [eax+4] // Range checking
00000030 jb 00000037
00000032 call 6ED23A5D // Throw IndexOutOfRange exception
00000037 mov ecx,dword ptr [ebp+8]
0000003a mov dword ptr [eax+edx*4+8],ecx // Actual assignment
}
0000003e nop
0000003f mov esp,ebp
00000041 pop ebp
00000042 ret 8 // Returning
That's a total disaster (in terms of code optimization). 这是一场彻底的灾难(就代码优化而言)。 So my questions are:
所以我的问题是:
Ok, I know, that the last one is hard to answer. 好吧,我知道,最后一个很难回答。 But lately I read many questions about C++ performance and was amazed how much can optimizer do (for example, total inlining of
std::tie
, two std::tuple
ctors and overloaded opeartor <
on the fly). 但是最近我读了许多关于C ++性能的问题,并且惊讶于优化器可以做多少(例如,
std::tie
总内联,两个std::tuple
opeartor <
和opeartor <
on-fly)。
Edit : (in response to comments)
编辑 :(回应评论)
It seems, that actually that was still my fault, because I checked the performance while running the IDE . 看来,实际上这仍然是我的错,因为我在运行IDE时检查了性能。 Now I ran the same program out of IDE and attached to it by debugger on-the-fly.
现在我从IDE运行相同的程序,并通过调试器即时连接到它。 Now I get:
现在我得到:
bigArray.data[y * bigArray.width + x] = 12;
000000ae mov eax,dword ptr [ebp-10h]
000000b1 imul eax,edx
000000b4 add eax,ebx
000000b6 cmp eax,edi
000000b8 jae 000001FA
000000be mov dword ptr [ecx+eax*4+8],0Ch
bigArray[x, y] = 12;
0000016b mov eax,dword ptr [ebp-14h]
0000016e imul eax,edx
00000171 add eax,ebx
00000173 cmp eax,edi
00000175 jae 000001FA
0000017b mov dword ptr [ecx+eax*4+8],0Ch
These codes are exactly the same (in terms of CPU instructions). 这些代码完全相同(就CPU指令而言)。 After running, the indexer version achieved even better results than direct one, but only (I guess) because of cache'ing.
运行后,索引器版本实现了比直接版本更好的结果,但只是(我猜)因为缓存。 After putting the tests inside a loop, everything went back to normal:
将测试放入循环后,一切都恢复正常:
Directly: 573 ms
Via indexer: 353 ms
Directly: 356 ms
Via indexer: 362 ms
Directly: 351 ms
Via indexer: 370 ms
Directly: 351 ms
Via indexer: 354 ms
Directly: 359 ms
Via indexer: 356 ms
Well; 好; lesson learned.
学过的知识。 Even though compiling in Release mode, there is a huge difference, whether program is run in IDE or standalone .
即使在发布模式下进行编译,无论程序是在IDE中运行还是独立运行,都会产生巨大的差异 。 Thanks @harold for the idea.
谢谢@harold的想法。
Running code with the debugger immediately attached makes it generate slow code (unless you enable "Suppress JIT optimization on module load", but that makes debugging a little hard). 运行代码并立即附加调试器会使其生成慢速代码(除非您启用“在模块加载时禁止JIT优化”,但这会使调试变得有点困难)。 The procedure I use to view the optimized assembly is to throw an exception (conditionally, say, if a static variable is 0, so the optimizer doesn't get too trigger-happy), and attach the debugger when it crashes.
我用来查看优化程序集的过程是抛出一个异常(有条件地说,如果一个静态变量是0,那么优化器不会太快触发),并在崩溃时附加调试器。 You'll probably have to go through the "Manually choose debuggers"-route.
您可能必须通过“手动选择调试器”-route。 Also, make sure you enable "Show external code" (from the context menu on the call stack).
此外,请确保启用“显示外部代码”(从调用堆栈的上下文菜单中)。
The code I got for the direct access was 我直接访问的代码是
innerloop:
mov eax,dword ptr [esi+8] ; bigArray.width
imul eax,ebx ; * y
add eax,edi ; + x
mov edx,dword ptr [ebp-14h] ; pointer to bigArray.data
mov ecx,dword ptr [ebp-10h] ; \
cmp eax,ecx ; | bounds check
jae 00000087 ; /
mov dword ptr [edx+eax*4+8],0Ch ; data[index] = 12
inc edi ; x++
cmp edi,dword ptr [esi+8] ; \
jl innerloop ; / if (x < bigArray.width) goto innerloop
And for the indexer: 对于索引器:
innerloop:
mov eax,dword ptr [esi+8] ; bigArray.width
imul eax,ebx ; * y
add eax,edi ; + x
mov edx,dword ptr [ebp-14h] ; pointer to bigArray.data
mov ecx,dword ptr [ebp-10h] ; \
cmp eax,ecx ; | bounds check
jae 00000087 ; /
mov dword ptr [edx+eax*4+8],0Ch ; data[index] = 12
inc edi ; x++
cmp edi,dword ptr [esi+8] ; \
jl innerloop ; / if (x < bigArray.width) goto innerloop
This is not a paste-mistake, the code for the inner loop really was exactly the same. 这不是粘贴错误,内循环的代码确实完全相同 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.