简体   繁体   English

为什么拆箱比拳击快100倍

[英]Why unboxing is 100 time faster than boxing

Why is there so much speed change between boxing and unboxing operations? 为什么拳击和拆箱操作之间的速度变化如此之大? There is 10 times difference. 有10倍的差异。 When should we care about this? 我们什么时候应该关心这个? Last week an Azure support told us there is an issue in the heap memory of our application. 上周Azure支持告诉我们,我们的应用程序的堆内存中存在问题。 I am curious to know if it could be related to the boxing-unboxing issue. 我很想知道它是否与装箱拆箱问题有关。

using System;
using System.Diagnostics;

namespace ConsoleBoxing
{
class Program
{
    static void Main(string[] args)
    {
        Console.WriteLine("Program started");
        var elapsed = Boxing();
        Unboxing(elapsed);
        Console.WriteLine("Program ended");
        Console.Read();
    }

    private static void Unboxing(double boxingtime)
    {
        Stopwatch s = new Stopwatch();
        s.Start();
        for (int i = 0; i < 1000000; i++)
        {
            int a = 33;//DATA GOES TO STACK
            object b = a;//HEAP IS REFERENCED
            int c = (int)b;//unboxing only hEre ....HEAP GOES TO STACK
        }
        s.Stop();

        var UnBoxing =  s.Elapsed.TotalMilliseconds- boxingtime;
        Console.WriteLine("UnBoxing time : " + UnBoxing);
    }

    private static double Boxing()
    {
        Stopwatch s = new Stopwatch();
        s.Start();
        for (int i = 0; i < 1000000; i++)
        {
            int a = 33;
            object b = a;
        }
        s.Stop();
        var elapsed = s.Elapsed.TotalMilliseconds;
        Console.WriteLine("Boxing time : " + elapsed);
        return elapsed;
    }
}
}

Think of unboxing as a single memory load instruction from the boxed object to a register. 可以将拆箱视为从盒装​​对象到寄存器的单个内存加载指令。 Maybe with a bit of surrounding address calculation and cast validation logic. 可能有一些周围的地址计算和转换验证逻辑。 A boxed object is like a class with one field of the boxed type. 盒装对象就像一个带有一个盒装类型字段的类。 How expensive can those operations be? 这些操作有多贵? Not very, especially since the L1 cache hit rate in your benchmark is ~100%. 不是很特别,因为基准测试中的L1缓存命中率约为100%。

Boxing involves allocating a new object and GC'ing it later. 拳击涉及分配一个新的对象和GC以后。 In your code the GC probably triggers on the allocation in 99% of the cases. 在您的代码中,GC可能会在99%的情况下触发分配。

That said your benchmark is invalid because the loops have no side-effects. 这表示你的基准测试无效,因为循环没有副作用。 It is probably luck that the current JIT cannot optimize them away. 目前的JIT可能无法优化它们。 Somehow have the loop compute a result and funnel it into GC.KeepAlive to make the result appear used. 以某种方式让循环计算结果并将其GC.KeepAliveGC.KeepAlive以使结果显示为使用。 Also, you might be running Debug mode. 此外,您可能正在运行调试模式。

Although people have offered fantastic explanations already for why unboxing is faster than boxing. 虽然人们已经提供了很好的解释,为什么拆箱比拳击更快。 I want to say a little bit more on the methodology you used to test the performance difference. 我想更多地谈谈用于测试性能差异的方法。

Did you get your result (10x difference in speed) from the code you posted? 你从你发布的代码中得到了你的结果(速度差异是10倍)吗? If I run that program in release mode, here is the output: 如果我在发布模式下运行该程序,这是输出:

Program started
Boxing time : 0.2741
UnBoxing time : 4.5847
Program ended

Whenever I am doing a micro performance benchmark, I tend to further verify I am indeed comparing the operation I intended to compare. 每当我进行微观性能基准测试时,我倾向于进一步验证我确实在比较我想要比较的操作。 Compiler can make optimization to your code. 编译器可以对您的代码进行优化。 Open the executable in ILDASM: 在ILDASM中打开可执行文件:

Here is the IL for UnBoxing: (I only included the portion that matters most) 这是拆箱的IL :(我只包括最重要的部分)

IL_0000:  newobj     instance void [System]System.Diagnostics.Stopwatch::.ctor()
IL_0005:  stloc.0
IL_0006:  ldloc.0 
IL_0007:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Start()
IL_000c:  ldc.i4.0
IL_000d:  stloc.1
IL_000e:  br.s       IL_0025
IL_0010:  ldc.i4.s   33
IL_0012:  stloc.2
IL_0013:  ldloc.2
IL_0014:  box        [mscorlib]System.Int32    //Here is the boxing
IL_0019:  stloc.3
IL_001a:  ldloc.3
IL_001b:  unbox.any  [mscorlib]System.Int32    //Here is the unboxing
IL_0020:  pop
IL_0021:  ldloc.1
IL_0022:  ldc.i4.1
IL_0023:  add
IL_0024:  stloc.1
IL_0025:  ldloc.1
IL_0026:  ldc.i4     0xf4240
IL_002b:  blt.s      IL_0010
IL_002d:  ldloc.0
IL_002e:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Stop()

And this is the code for Boxing: 这是拳击的代码:

IL_0000:  newobj     instance void [System]System.Diagnostics.Stopwatch::.ctor()
IL_0005:  stloc.0
IL_0006:  ldloc.0
IL_0007:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Start()
IL_000c:  ldc.i4.0
IL_000d:  stloc.1
IL_000e:  br.s       IL_0017
IL_0010:  ldc.i4.s   33
IL_0012:  stloc.2
IL_0013:  ldloc.1
IL_0014:  ldc.i4.1
IL_0015:  add
IL_0016:  stloc.1
IL_0017:  ldloc.1
IL_0018:  ldc.i4     0xf4240
IL_001d:  blt.s      IL_0010
IL_001f:  ldloc.0
IL_0020:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Stop()

No boxing instruction at all in the Boxing method . 在拳击方法中根本没有拳击指令 It has been completely removed by compiler. 它已被编译器完全删除。 The Boxing method does nothing but iterating an empty loop. Boxing方法除了迭代空循环外什么都不做。 The time measured in UnBoxing therefore becomes the total time of boxing and unboxing. 因此,在UnBoxing中测量的时间将成为装箱和拆箱的总时间。

Micro-benchmarking is very vulnerable to compiler tricks. 微基准测试非常容易受到编译器技巧的影响。 I would suggest you have a look at your IL as well. 我建议你也看看你的IL。 It may be different if you are using a different compiler. 如果您使用不同的编译器,可能会有所不同。

I modified your test code a little bit: 我稍微修改了你的测试代码:

Boxing method: 拳击方法:

private static object Boxing()
{
    Stopwatch s = new Stopwatch();

    int unboxed = 33;
    object boxed = null;

    s.Start();

    for (int i = 0; i < 1000000; i++)
    {
        boxed = unboxed;
    }

    s.Stop();

    var elapsed = s.Elapsed.TotalMilliseconds;
    Console.WriteLine("Boxing time : " + elapsed);

    return boxed;
}

And Unboxing method: 和拆箱方法:

private static int Unboxing()
{
    Stopwatch s = new Stopwatch();

    object boxed = 33;
    int unboxed = 0;

    s.Start();

    for (int i = 0; i < 1000000; i++)
    {
        unboxed = (int)boxed;
    }

    s.Stop();

    var time = s.Elapsed.TotalMilliseconds;
    Console.WriteLine("UnBoxing time : " + time);

    return unboxed;
}

So that they can be translated into similar IL: 这样他们就可以翻译成类似的IL:

For Boxing method: 对于拳击方法:

IL_000c:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Start()
IL_0011:  ldc.i4.0
IL_0012:  stloc.3
IL_0013:  br.s       IL_0020
IL_0015:  ldloc.1
IL_0016:  box        [mscorlib]System.Int32  //Here is the boxing
IL_001b:  stloc.2
IL_001c:  ldloc.3
IL_001d:  ldc.i4.1
IL_001e:  add
IL_001f:  stloc.3
IL_0020:  ldloc.3
IL_0021:  ldc.i4     0xf4240
IL_0026:  blt.s      IL_0015
IL_0028:  ldloc.0
IL_0029:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Stop()

For UnBoxing: 对于取消装箱:

IL_0011:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Start()
IL_0016:  ldc.i4.0
IL_0017:  stloc.3
IL_0018:  br.s       IL_0025
IL_001a:  ldloc.1
IL_001b:  unbox.any  [mscorlib]System.Int32  //Here is the UnBoxng
IL_0020:  stloc.2
IL_0021:  ldloc.3
IL_0022:  ldc.i4.1
IL_0023:  add
IL_0024:  stloc.3
IL_0025:  ldloc.3
IL_0026:  ldc.i4     0xf4240
IL_002b:  blt.s      IL_001a
IL_002d:  ldloc.0
IL_002e:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Stop()

Run several loops to remove the cold startup effect: 运行几个循环以删除冷启动效果:

static void Main(string[] args)
{
    Console.WriteLine("Program started");
    for (int i = 0; i < 10; i++)
    {
        Boxing();
        Unboxing();
    }
    Console.WriteLine("Program ended");
    Console.Read();
}

Here is the output: 这是输出:

Program started
Boxing time : 3.4814
UnBoxing time : 0.1712
Boxing time : 2.6294
...
Boxing time : 2.4842
UnBoxing time : 0.1712
Program ended

Does that prove that unboxing is 10x faster than boxing? 这是否证明拆箱比拳击快10倍 Let's checkout the assembly code with windbg: 让我们用windbg检查汇编代码:

0:004> !u 000007fe93b83940
Normal JIT generated code
MicroBenchmarks.Program.Boxing()
...
000007fe`93ca01b3 call    System_ni+0x2905e0 (000007fe`f07a05e0) (System.Diagnostics.Stopwatch.GetTimestamp(), mdToken: 00000000060040d2)
...
//This is the for loop
000007fe`93ca01c2 mov     eax,21h
000007fe`93ca01c7 mov     dword ptr [rsp+20h],eax
000007fe`93ca01cb lea     rdx,[rsp+20h]
000007fe`93ca01d0 lea     rcx,[mscorlib_ni+0x6e92b0 (000007fe`f18b92b0)]
//here is the boxing
000007fe`93ca01d7 call    clr!JIT_BoxFastMP_InlineGetThread (000007fe`f33126d0)   
000007fe`93ca01dc mov     rsi,rax
//loop unrolling. instead of increment i by 1, we are actually incrementing i by 4
000007fe`93ca01df add     edi,4                 
000007fe`93ca01e2 cmp     edi,0F4240h           // 0F4240h = 1000000
000007fe`93ca01e8 jl      000007fe`93ca01c2     // jumps to the line "mov eax,21h"
//end of the for loop
000007fe`93ca01ea mov     rcx,rbx
000007fe`93ca01ed call    System_ni+0x2acb70 (000007fe`f07bcb70) (System.Diagnostics.Stopwatch.Stop(), mdToken: 00000000060040cb)

The assembly for UnBoxing: UnBoxing程序集:

0:004> !u 000007fe93b83930
Normal JIT generated code
MicroBenchmarks.Program.Unboxing()
Begin 000007fe93ca02c0, size 117
000007fe`93ca02c0 push    rbx
...
000007fe`93ca030a call    System_ni+0x2905e0 (000007fe`f07a05e0) (System.Diagnostics.Stopwatch.GetTimestamp(), mdToken: 00000000060040d2)
000007fe`93ca030f mov     qword ptr [rbx+10h],rax
000007fe`93ca0313 mov     byte ptr [rbx+18h],1
000007fe`93ca0317 xor     eax,eax
000007fe`93ca0319 mov     edi,dword ptr [rdi+8]
000007fe`93ca031c nop     dword ptr [rax]
//This is the for loop
//again, loop unrolling
000007fe`93ca0320 add     eax,4
000007fe`93ca0323 cmp     eax,0F4240h    // 0F4240h = 1000000
000007fe`93ca0328 jl      000007fe`93ca0320  //jumps to "add eax,4"
//end of the for loop
000007fe`93ca032a mov     rcx,rbx
000007fe`93ca032d call    System_ni+0x2acb70 (000007fe`f07bcb70) (System.Diagnostics.Stopwatch.Stop(), mdToken: 00000000060040cb)

You can see that even if at the IL level the comparison seems to be reasonable, JIT can still perform another optimization at runtime. 您可以看到即使在IL级别上比较似乎是合理的,JIT仍然可以在运行时执行另一个优化。 The UnBoxing method is doing am empty loop again. UnBoxing方法再次进行空循环。 Untill you verify the code executed for the two methods are comparable, it is very hard to simply conclude "unboxing is 10x faster then boxing" 直到你验证为两种方法执行的代码是可比较的,很难简单地总结“拆箱比拳击快10倍”

Because boxing involves objects, and unboxing involves primitives. 因为装箱涉及对象,而拆箱涉及基元。 The entire purpose of primitives in an OOP language is to improve performance; OOP语言中原语的全部目的是提高性能; so it should not seem surprising that it has succeeded. 所以它成功了就不足为奇了。

Boxing creates a new object on the heap. Boxing在堆上创建一个新对象。 Like array initialisation: 像数组初始化一样:

int[] arr = {10, 20, 30};

boxing provides a convenient initialization syntax, so you don't have to explicitly use the new operator. boxing提供了方便的初始化语法,因此您不必显式使用new运算符。 But in fact there is instantiation going on. 但实际上实例回事。

Unboxing is much cheaper: follow the reference to the boxed value, and retrieve the value. 拆箱要便宜得多:遵循对盒装值的引用,并检索值。

Boxing has all the overhead of creating a reference type object on the heap. 拳击具有在堆上创建引用类型对象的所有开销。

Unboxing only has the overhead of indirection. 拆箱只有间接开销。

Consider this: For boxing you must allocate memory. 考虑一下:对于拳击你必须分配内存。 For unboxing you must not. 对于拆箱,你一定不能。 Given that unboxing is a trivial operation (especailly in your case where even nothing happenx to the result. 鉴于拆箱是一项微不足道的操作(特别是在你的情况下,甚至没有任何事情发生在结果上)。

Boxing and unboxing are computationally expensive processes. 拳击和拆箱是计算上昂贵的过程。 When a value type is boxed, an entirely new object must be created. 装箱值类型时,必须创建一个全新的对象。 This can take up to 20 times longer than a simple reference assignment. 这可能比简单的参考分配长20倍。 When unboxing, the casting process can take four times as long as an assignment. 拆箱时,铸造过程可能需要四倍的分配。

Why unboxing is 100 time faster than boxing

When you box a value type, a new object has to be created and the value has to be copied into the new object. 当您键入值类型时,必须创建一个新对象,并且必须将值复制到新对象中。 When unboxing, only the value has to be copied from the boxed instance. 取消装箱时,只需从装箱实例中复制该值。 So boxing adds the creation of an object. 所以拳击添加了一个对象的创建。 This, however, is really fast in .NET, so the difference is probably not very large. 然而,这在.NET中确实很快,因此差异可能不是很大。 Try to avoid the whole boxing procedure in the first place if you need maximum speed. 如果您需要最大速度,请尽量避免整个拳击程序。 Remember that boxing creates objects that need to be cleaned up by the garbage collector 请记住,装箱会创建需要由垃圾收集器清理的对象

One of the things that can make a program slow is when you have to move something in and out of memory. 使程序变慢的一个原因是当你必须移入和移出内存时。 Accessing memory should be avoided if it's not necessary (if you want speed). 如果没有必要(如果你想要速度),应该避免访问内存。

If I look up what unboxing and boxing does you see that the difference is that boxing allocates memory on the heap and unboxing moves a value-type variable to the stack. 如果我查看拆箱和装箱你看到的区别在于装箱在堆上分配内存并且拆箱将值类型变量移动到堆栈。 Accesing the stack is faster than the heap and therefore unboxing is in your case faster. 访问堆栈比堆快,因此在您的情况下拆箱更快。

The stack is faster because the access pattern makes it trivial to allocate and deallocate memory from it (a pointer/integer is simply incremented or decremented), while the heap has much more complex bookkeeping involved in an allocation or free. 堆栈更快,因为访问模式使得从中分配和释放内存变得微不足道(指针/整数简单地递增或递减),而堆在分配或免费中涉及更复杂的簿记。 Also, each byte in the stack tends to be reused very frequently which means it tends to be mapped to the processor's cache, making it very fast. 此外,堆栈中的每个字节都经常被频繁地重用,这意味着它往往被映射到处理器的缓存,使其非常快。 Another performance hit for the heap is that the heap, being mostly a global resource, typically has to be multi-threading safe, ie each allocation and deallocation needs to be - typically - synchronized with "all" other heap accesses in the program. 堆的另一个性能损失是堆(主要是全局资源)通常必须是多线程安全的,即每个分配和释放需要 - 通常 - 与程序中的“所有”其他堆访问同步。

I got this information here from SwankyLegg: What and where are the stack and heap? 我从SwankyLegg这里得到了这些信息: 堆栈和堆的内容和位置是什么?

To see what the difference of unboxing and boxing does to the memory (stack and heap) you can look it up here: http://msdn.microsoft.com/en-us/library/yz2be5wk.aspx 要查看拆箱和装箱对内存(堆栈和堆)的区别,您可以在此查找: http//msdn.microsoft.com/en-us/library/yz2be5wk.aspx

To keep things simple, try to use primitive types where you can and don't make references to memory if you can. 为了简单起见,尽可能使用原始类型,如果可以的话,不要引用内存。 If you really want speed you should look into caching, pre-fetching, blocking.. 如果你真的想要速度,你应该考虑缓存,预取,阻止..

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM