简体   繁体   中英

Why unboxing is 100 time faster than boxing

Why is there so much speed change between boxing and unboxing operations? There is 10 times difference. When should we care about this? Last week an Azure support told us there is an issue in the heap memory of our application. I am curious to know if it could be related to the boxing-unboxing issue.

using System;
using System.Diagnostics;

namespace ConsoleBoxing
{
class Program
{
    static void Main(string[] args)
    {
        Console.WriteLine("Program started");
        var elapsed = Boxing();
        Unboxing(elapsed);
        Console.WriteLine("Program ended");
        Console.Read();
    }

    private static void Unboxing(double boxingtime)
    {
        Stopwatch s = new Stopwatch();
        s.Start();
        for (int i = 0; i < 1000000; i++)
        {
            int a = 33;//DATA GOES TO STACK
            object b = a;//HEAP IS REFERENCED
            int c = (int)b;//unboxing only hEre ....HEAP GOES TO STACK
        }
        s.Stop();

        var UnBoxing =  s.Elapsed.TotalMilliseconds- boxingtime;
        Console.WriteLine("UnBoxing time : " + UnBoxing);
    }

    private static double Boxing()
    {
        Stopwatch s = new Stopwatch();
        s.Start();
        for (int i = 0; i < 1000000; i++)
        {
            int a = 33;
            object b = a;
        }
        s.Stop();
        var elapsed = s.Elapsed.TotalMilliseconds;
        Console.WriteLine("Boxing time : " + elapsed);
        return elapsed;
    }
}
}

Think of unboxing as a single memory load instruction from the boxed object to a register. Maybe with a bit of surrounding address calculation and cast validation logic. A boxed object is like a class with one field of the boxed type. How expensive can those operations be? Not very, especially since the L1 cache hit rate in your benchmark is ~100%.

Boxing involves allocating a new object and GC'ing it later. In your code the GC probably triggers on the allocation in 99% of the cases.

That said your benchmark is invalid because the loops have no side-effects. It is probably luck that the current JIT cannot optimize them away. Somehow have the loop compute a result and funnel it into GC.KeepAlive to make the result appear used. Also, you might be running Debug mode.

Although people have offered fantastic explanations already for why unboxing is faster than boxing. I want to say a little bit more on the methodology you used to test the performance difference.

Did you get your result (10x difference in speed) from the code you posted? If I run that program in release mode, here is the output:

Program started
Boxing time : 0.2741
UnBoxing time : 4.5847
Program ended

Whenever I am doing a micro performance benchmark, I tend to further verify I am indeed comparing the operation I intended to compare. Compiler can make optimization to your code. Open the executable in ILDASM:

Here is the IL for UnBoxing: (I only included the portion that matters most)

IL_0000:  newobj     instance void [System]System.Diagnostics.Stopwatch::.ctor()
IL_0005:  stloc.0
IL_0006:  ldloc.0 
IL_0007:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Start()
IL_000c:  ldc.i4.0
IL_000d:  stloc.1
IL_000e:  br.s       IL_0025
IL_0010:  ldc.i4.s   33
IL_0012:  stloc.2
IL_0013:  ldloc.2
IL_0014:  box        [mscorlib]System.Int32    //Here is the boxing
IL_0019:  stloc.3
IL_001a:  ldloc.3
IL_001b:  unbox.any  [mscorlib]System.Int32    //Here is the unboxing
IL_0020:  pop
IL_0021:  ldloc.1
IL_0022:  ldc.i4.1
IL_0023:  add
IL_0024:  stloc.1
IL_0025:  ldloc.1
IL_0026:  ldc.i4     0xf4240
IL_002b:  blt.s      IL_0010
IL_002d:  ldloc.0
IL_002e:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Stop()

And this is the code for Boxing:

IL_0000:  newobj     instance void [System]System.Diagnostics.Stopwatch::.ctor()
IL_0005:  stloc.0
IL_0006:  ldloc.0
IL_0007:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Start()
IL_000c:  ldc.i4.0
IL_000d:  stloc.1
IL_000e:  br.s       IL_0017
IL_0010:  ldc.i4.s   33
IL_0012:  stloc.2
IL_0013:  ldloc.1
IL_0014:  ldc.i4.1
IL_0015:  add
IL_0016:  stloc.1
IL_0017:  ldloc.1
IL_0018:  ldc.i4     0xf4240
IL_001d:  blt.s      IL_0010
IL_001f:  ldloc.0
IL_0020:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Stop()

No boxing instruction at all in the Boxing method . It has been completely removed by compiler. The Boxing method does nothing but iterating an empty loop. The time measured in UnBoxing therefore becomes the total time of boxing and unboxing.

Micro-benchmarking is very vulnerable to compiler tricks. I would suggest you have a look at your IL as well. It may be different if you are using a different compiler.

I modified your test code a little bit:

Boxing method:

private static object Boxing()
{
    Stopwatch s = new Stopwatch();

    int unboxed = 33;
    object boxed = null;

    s.Start();

    for (int i = 0; i < 1000000; i++)
    {
        boxed = unboxed;
    }

    s.Stop();

    var elapsed = s.Elapsed.TotalMilliseconds;
    Console.WriteLine("Boxing time : " + elapsed);

    return boxed;
}

And Unboxing method:

private static int Unboxing()
{
    Stopwatch s = new Stopwatch();

    object boxed = 33;
    int unboxed = 0;

    s.Start();

    for (int i = 0; i < 1000000; i++)
    {
        unboxed = (int)boxed;
    }

    s.Stop();

    var time = s.Elapsed.TotalMilliseconds;
    Console.WriteLine("UnBoxing time : " + time);

    return unboxed;
}

So that they can be translated into similar IL:

For Boxing method:

IL_000c:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Start()
IL_0011:  ldc.i4.0
IL_0012:  stloc.3
IL_0013:  br.s       IL_0020
IL_0015:  ldloc.1
IL_0016:  box        [mscorlib]System.Int32  //Here is the boxing
IL_001b:  stloc.2
IL_001c:  ldloc.3
IL_001d:  ldc.i4.1
IL_001e:  add
IL_001f:  stloc.3
IL_0020:  ldloc.3
IL_0021:  ldc.i4     0xf4240
IL_0026:  blt.s      IL_0015
IL_0028:  ldloc.0
IL_0029:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Stop()

For UnBoxing:

IL_0011:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Start()
IL_0016:  ldc.i4.0
IL_0017:  stloc.3
IL_0018:  br.s       IL_0025
IL_001a:  ldloc.1
IL_001b:  unbox.any  [mscorlib]System.Int32  //Here is the UnBoxng
IL_0020:  stloc.2
IL_0021:  ldloc.3
IL_0022:  ldc.i4.1
IL_0023:  add
IL_0024:  stloc.3
IL_0025:  ldloc.3
IL_0026:  ldc.i4     0xf4240
IL_002b:  blt.s      IL_001a
IL_002d:  ldloc.0
IL_002e:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Stop()

Run several loops to remove the cold startup effect:

static void Main(string[] args)
{
    Console.WriteLine("Program started");
    for (int i = 0; i < 10; i++)
    {
        Boxing();
        Unboxing();
    }
    Console.WriteLine("Program ended");
    Console.Read();
}

Here is the output:

Program started
Boxing time : 3.4814
UnBoxing time : 0.1712
Boxing time : 2.6294
...
Boxing time : 2.4842
UnBoxing time : 0.1712
Program ended

Does that prove that unboxing is 10x faster than boxing? Let's checkout the assembly code with windbg:

0:004> !u 000007fe93b83940
Normal JIT generated code
MicroBenchmarks.Program.Boxing()
...
000007fe`93ca01b3 call    System_ni+0x2905e0 (000007fe`f07a05e0) (System.Diagnostics.Stopwatch.GetTimestamp(), mdToken: 00000000060040d2)
...
//This is the for loop
000007fe`93ca01c2 mov     eax,21h
000007fe`93ca01c7 mov     dword ptr [rsp+20h],eax
000007fe`93ca01cb lea     rdx,[rsp+20h]
000007fe`93ca01d0 lea     rcx,[mscorlib_ni+0x6e92b0 (000007fe`f18b92b0)]
//here is the boxing
000007fe`93ca01d7 call    clr!JIT_BoxFastMP_InlineGetThread (000007fe`f33126d0)   
000007fe`93ca01dc mov     rsi,rax
//loop unrolling. instead of increment i by 1, we are actually incrementing i by 4
000007fe`93ca01df add     edi,4                 
000007fe`93ca01e2 cmp     edi,0F4240h           // 0F4240h = 1000000
000007fe`93ca01e8 jl      000007fe`93ca01c2     // jumps to the line "mov eax,21h"
//end of the for loop
000007fe`93ca01ea mov     rcx,rbx
000007fe`93ca01ed call    System_ni+0x2acb70 (000007fe`f07bcb70) (System.Diagnostics.Stopwatch.Stop(), mdToken: 00000000060040cb)

The assembly for UnBoxing:

0:004> !u 000007fe93b83930
Normal JIT generated code
MicroBenchmarks.Program.Unboxing()
Begin 000007fe93ca02c0, size 117
000007fe`93ca02c0 push    rbx
...
000007fe`93ca030a call    System_ni+0x2905e0 (000007fe`f07a05e0) (System.Diagnostics.Stopwatch.GetTimestamp(), mdToken: 00000000060040d2)
000007fe`93ca030f mov     qword ptr [rbx+10h],rax
000007fe`93ca0313 mov     byte ptr [rbx+18h],1
000007fe`93ca0317 xor     eax,eax
000007fe`93ca0319 mov     edi,dword ptr [rdi+8]
000007fe`93ca031c nop     dword ptr [rax]
//This is the for loop
//again, loop unrolling
000007fe`93ca0320 add     eax,4
000007fe`93ca0323 cmp     eax,0F4240h    // 0F4240h = 1000000
000007fe`93ca0328 jl      000007fe`93ca0320  //jumps to "add eax,4"
//end of the for loop
000007fe`93ca032a mov     rcx,rbx
000007fe`93ca032d call    System_ni+0x2acb70 (000007fe`f07bcb70) (System.Diagnostics.Stopwatch.Stop(), mdToken: 00000000060040cb)

You can see that even if at the IL level the comparison seems to be reasonable, JIT can still perform another optimization at runtime. The UnBoxing method is doing am empty loop again. Untill you verify the code executed for the two methods are comparable, it is very hard to simply conclude "unboxing is 10x faster then boxing"

Because boxing involves objects, and unboxing involves primitives. The entire purpose of primitives in an OOP language is to improve performance; so it should not seem surprising that it has succeeded.

Boxing creates a new object on the heap. Like array initialisation:

int[] arr = {10, 20, 30};

boxing provides a convenient initialization syntax, so you don't have to explicitly use the new operator. But in fact there is instantiation going on.

Unboxing is much cheaper: follow the reference to the boxed value, and retrieve the value.

Boxing has all the overhead of creating a reference type object on the heap.

Unboxing only has the overhead of indirection.

Consider this: For boxing you must allocate memory. For unboxing you must not. Given that unboxing is a trivial operation (especailly in your case where even nothing happenx to the result.

Boxing and unboxing are computationally expensive processes. When a value type is boxed, an entirely new object must be created. This can take up to 20 times longer than a simple reference assignment. When unboxing, the casting process can take four times as long as an assignment.

Why unboxing is 100 time faster than boxing

When you box a value type, a new object has to be created and the value has to be copied into the new object. When unboxing, only the value has to be copied from the boxed instance. So boxing adds the creation of an object. This, however, is really fast in .NET, so the difference is probably not very large. Try to avoid the whole boxing procedure in the first place if you need maximum speed. Remember that boxing creates objects that need to be cleaned up by the garbage collector

One of the things that can make a program slow is when you have to move something in and out of memory. Accessing memory should be avoided if it's not necessary (if you want speed).

If I look up what unboxing and boxing does you see that the difference is that boxing allocates memory on the heap and unboxing moves a value-type variable to the stack. Accesing the stack is faster than the heap and therefore unboxing is in your case faster.

The stack is faster because the access pattern makes it trivial to allocate and deallocate memory from it (a pointer/integer is simply incremented or decremented), while the heap has much more complex bookkeeping involved in an allocation or free. Also, each byte in the stack tends to be reused very frequently which means it tends to be mapped to the processor's cache, making it very fast. Another performance hit for the heap is that the heap, being mostly a global resource, typically has to be multi-threading safe, ie each allocation and deallocation needs to be - typically - synchronized with "all" other heap accesses in the program.

I got this information here from SwankyLegg: What and where are the stack and heap?

To see what the difference of unboxing and boxing does to the memory (stack and heap) you can look it up here: http://msdn.microsoft.com/en-us/library/yz2be5wk.aspx

To keep things simple, try to use primitive types where you can and don't make references to memory if you can. If you really want speed you should look into caching, pre-fetching, blocking..

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM