简体   繁体   English

C#优化器性能差吗?

[英]Poor C# optimizer performance?

I've just written a small example checking, how C#'s optimizer behaves in case of indexers. 我刚刚写了一个小例子检查,C#的优化器在索引器的情况下表现如何。 The example is simple - I just wrap an array in a class and try to fill its values: once directly and once by indexer (which internally accesses the data exactly the same way as the direct solution does). 这个例子很简单 - 我只是将一个数组包装在一个类中并尝试填充它的值:一次直接和一次索引器(内部访问数据的方式与直接解决方案完全相同)。

    public class ArrayWrapper
    {
        public ArrayWrapper(int newWidth, int newHeight)
        {
            width = newWidth;
            height = newHeight;

            data = new int[width * height];
        }

        public int this[int x, int y]
        {
            get
            {
                return data[y * width + x];
            }
            set
            {
                data[y * width + x] = value;
            }
        }

        public readonly int width, height;
        public readonly int[] data;
    }

    public class Program
    {
        public static void Main(string[] args)
        {
            ArrayWrapper bigArray = new ArrayWrapper(15000, 15000);

            Stopwatch stopwatch = new Stopwatch();
            stopwatch.Start();
            for (int y = 0; y < bigArray.height; y++)
                for (int x = 0; x < bigArray.width; x++)
                    bigArray.data[y * bigArray.width + x] = 12;
            stopwatch.Stop();

            Console.WriteLine(String.Format("Directly: {0} ms", stopwatch.ElapsedMilliseconds));

            stopwatch.Restart();
            for (int y = 0; y < bigArray.height; y++)
                for (int x = 0; x < bigArray.width; x++)
                    bigArray[x, y] = 12;
            stopwatch.Stop();

            Console.WriteLine(String.Format("Via indexer: {0} ms", stopwatch.ElapsedMilliseconds));

            Console.ReadKey();
        }
    }

Many SO posts taught me, that a programmer should highly trust optimizer to do its job. 许多SO帖子告诉我,程序员应该高度信任优化者来完成它的工作。 But in this case results are quite surprising: 但在这种情况下,结果非常令人惊讶:

Directly: 1282 ms
Via indexer: 2134 ms

(Compiled in Release configuration with the optimizations on, I double-checked). (在发布配置中编译并进行优化,我仔细检查)。

That's a huge difference - no way being a statistical error (and it's both scalable and repeatable). 这是一个巨大的差异 - 不是一个统计错误(它既可扩展又可重复)。

It's a very unpleasant surprise: in this case I'd expect the compiler to inline the indexer (it even does not include any range-checking), but it didn't do it. 这是一个非常令人不快的惊喜:在这种情况下,我希望编译器内联索引器(它甚至不包括任何范围检查),但它没有这样做。 Here's the disassembly (note, that my comments are guesses on what is going on): 这是反汇编(请注意,我的评论是关于正在发生的事情的猜测 ):

Direct 直接

                    bigArray.data[y * bigArray.width + x] = 12;
000000a2  mov         eax,dword ptr [ebp-3Ch]  // Evaluate index of array
000000a5  mov         eax,dword ptr [eax+4] 
000000a8  mov         edx,dword ptr [ebp-3Ch] 
000000ab  mov         edx,dword ptr [edx+8] 
000000ae  imul        edx,dword ptr [ebp-10h]  
000000b2  add         edx,dword ptr [ebp-14h]  // ...until here
000000b5  cmp         edx,dword ptr [eax+4]    // Range checking
000000b8  jb          000000BF 
000000ba  call        6ED23CF5                 // Throw IndexOutOfRange
000000bf  mov         dword ptr [eax+edx*4+8],0Ch // Assign value to array

By indexer 通过索引器

                    bigArray[x, y] = 12;
0000015e  push        dword ptr [ebp-18h]       // Push x and y
00000161  push        0Ch                       // (prepare parameters)
00000163  mov         ecx,dword ptr [ebp-3Ch] 
00000166  mov         edx,dword ptr [ebp-1Ch] 
00000169  cmp         dword ptr [ecx],ecx 
0000016b  call        dword ptr ds:[004B27DCh]  // Call the indexer

(...)

                data[y * width + x] = value;
00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  sub         esp,8 
00000006  mov         dword ptr [ebp-8],ecx 
00000009  mov         dword ptr [ebp-4],edx 
0000000c  cmp         dword ptr ds:[004B171Ch],0 // Some additional checking, I guess?
00000013  je          0000001A 
00000015  call        6ED24648                   
0000001a  mov         eax,dword ptr [ebp-8]        // Evaluating index
0000001d  mov         eax,dword ptr [eax+4] 
00000020  mov         edx,dword ptr [ebp-8] 
00000023  mov         edx,dword ptr [edx+8] 
00000026  imul        edx,dword ptr [ebp+0Ch] 
0000002a  add         edx,dword ptr [ebp-4]        // ...until here
0000002d  cmp         edx,dword ptr [eax+4]        // Range checking
00000030  jb          00000037 
00000032  call        6ED23A5D                     // Throw IndexOutOfRange exception
00000037  mov         ecx,dword ptr [ebp+8] 
0000003a  mov         dword ptr [eax+edx*4+8],ecx  // Actual assignment
                }
0000003e  nop 
0000003f  mov         esp,ebp 
00000041  pop         ebp 
00000042  ret         8                            // Returning

That's a total disaster (in terms of code optimization). 这是一场彻底的灾难(就代码优化而言)。 So my questions are: 所以我的问题是:

  • Why this code (quite simple, actually) was not optimized properly? 为什么这段代码(实际上很简单)没有正确优化?
  • How can I modify this code, such that it is optimized as I wanted it to be (if possible)? 如何修改此代码,以便根据需要对其进行优化(如果可能)?
  • Can a programmer rely on C#'s optimizer as much as on C++'s one? 程序员可以像C ++那样依赖C#的优化器吗?

Ok, I know, that the last one is hard to answer. 好吧,我知道,最后一个很难回答。 But lately I read many questions about C++ performance and was amazed how much can optimizer do (for example, total inlining of std::tie , two std::tuple ctors and overloaded opeartor < on the fly). 但是最近我读了许多关于C ++性能的问题,并且惊讶于优化器可以做多少(例如, std::tie总内联,两个std::tuple opeartor <opeartor < on-fly)。


Edit : (in response to comments) 编辑 :(回应评论)

It seems, that actually that was still my fault, because I checked the performance while running the IDE . 看来,实际上这仍然是我的错,因为我在运行IDE时检查了性能。 Now I ran the same program out of IDE and attached to it by debugger on-the-fly. 现在我从IDE运行相同的程序,并通过调试器即时连接到它。 Now I get: 现在我得到:

Direct 直接

                    bigArray.data[y * bigArray.width + x] = 12;
000000ae  mov         eax,dword ptr [ebp-10h] 
000000b1  imul        eax,edx 
000000b4  add         eax,ebx 
000000b6  cmp         eax,edi 
000000b8  jae         000001FA 
000000be  mov         dword ptr [ecx+eax*4+8],0Ch 

Indexer 索引

                    bigArray[x, y] = 12;
0000016b  mov         eax,dword ptr [ebp-14h] 
0000016e  imul        eax,edx 
00000171  add         eax,ebx 
00000173  cmp         eax,edi 
00000175  jae         000001FA 
0000017b  mov         dword ptr [ecx+eax*4+8],0Ch 

These codes are exactly the same (in terms of CPU instructions). 这些代码完全相同(就CPU指令而言)。 After running, the indexer version achieved even better results than direct one, but only (I guess) because of cache'ing. 运行后,索引器版本实现了比直接版本更好的结果,但只是(我猜)因为缓存。 After putting the tests inside a loop, everything went back to normal: 将测试放入循环后,一切都恢复正常:

Directly: 573 ms
Via indexer: 353 ms
Directly: 356 ms
Via indexer: 362 ms
Directly: 351 ms
Via indexer: 370 ms
Directly: 351 ms
Via indexer: 354 ms
Directly: 359 ms
Via indexer: 356 ms

Well; 好; lesson learned. 学过的知识。 Even though compiling in Release mode, there is a huge difference, whether program is run in IDE or standalone . 即使在发布模式下进行编译,无论程序是在IDE中运行还是独立运行,都会产生巨大的差异 Thanks @harold for the idea. 谢谢@harold的想法。

Running code with the debugger immediately attached makes it generate slow code (unless you enable "Suppress JIT optimization on module load", but that makes debugging a little hard). 运行代码并立即附加调试器会使其生成慢速代码(除非您启用“在模块加载时禁止JIT优化”,但这会使调试变得有点困难)。 The procedure I use to view the optimized assembly is to throw an exception (conditionally, say, if a static variable is 0, so the optimizer doesn't get too trigger-happy), and attach the debugger when it crashes. 我用来查看优化程序集的过程是抛出一个异常(有条件地说,如果一个静态变量是0,那么优化器不会太快触发),并在崩溃时附加调试器。 You'll probably have to go through the "Manually choose debuggers"-route. 您可能必须通过“手动选择调试器”-route。 Also, make sure you enable "Show external code" (from the context menu on the call stack). 此外,请确保启用“显示外部代码”(从调用堆栈的上下文菜单中)。

The code I got for the direct access was 我直接访问的代码是

innerloop:
  mov  eax,dword ptr [esi+8]   ; bigArray.width
  imul eax,ebx                 ; * y
  add  eax,edi                 ; + x
  mov  edx,dword ptr [ebp-14h] ; pointer to bigArray.data
  mov  ecx,dword ptr [ebp-10h] ; \
  cmp  eax,ecx                 ; |  bounds check
  jae  00000087                ; /
  mov  dword ptr [edx+eax*4+8],0Ch ; data[index] = 12
  inc  edi                     ; x++
  cmp  edi,dword ptr [esi+8]   ; \
  jl   innerloop               ; / if (x < bigArray.width) goto innerloop

And for the indexer: 对于索引器:

innerloop:
  mov  eax,dword ptr [esi+8]   ; bigArray.width
  imul eax,ebx                 ; * y
  add  eax,edi                 ; + x
  mov  edx,dword ptr [ebp-14h] ; pointer to bigArray.data
  mov  ecx,dword ptr [ebp-10h] ; \
  cmp  eax,ecx                 ; |  bounds check
  jae  00000087                ; /
  mov  dword ptr [edx+eax*4+8],0Ch ; data[index] = 12
  inc  edi                     ; x++
  cmp  edi,dword ptr [esi+8]   ; \
  jl   innerloop               ; / if (x < bigArray.width) goto innerloop

This is not a paste-mistake, the code for the inner loop really was exactly the same. 这不是粘贴错误,内循环的代码确实完全相同

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM