简体   繁体   中英

C#( Mono ) managed vs unmanaged arrays: benchmark favours managed?

Benchmarking the following surprisingly gives better results for managed arrays( 10% faster, consistently ). I'm testing in Unity, so maybe it relates to Mono?

unsafe void Bench()
{
    //Locals
    int i, j;
    const int   bufSize         = 1024 * 1024;
    const int   numIterations   = 1000;
    const float gain            = 1.6745f;

    float[] managedBuffer;

    IntPtr  ptr;
    float * unmanagedBuffer;

    Stopwatch stopwatch; 

    // Allocations
    managedBuffer = new float[ bufSize ];
    for( i = 0; i < bufSize; i++ )
    {
        managedBuffer[ i ] = UnityEngine.Random.value;
    }

    ptr             = Marshal.AllocHGlobal( bufSize * sizeof( float ) );
    unmanagedBuffer = ( float * )ptr.ToPointer();

    Marshal.Copy( managedBuffer, 0, ptr, bufSize );

    stopwatch = new Stopwatch();
    stopwatch.Start();

    // Unmanaged array iterations
    for( i = 0; i < numIterations; i++ )
    {
        for( j = 0; j < bufSize; j++ )
        {
            unmanagedBuffer[ j ] *= gain;
        }
    }

    UnityEngine.Debug.Log( stopwatch.ElapsedMilliseconds );

    stopwatch.Reset();
    stopwatch.Start();

    // Managed array iterations
    for( i = 0; i < numIterations; i++ )
    {
        for( j = 0; j < bufSize; j++ )
        {
            managedBuffer[ j ] *= gain;
        }
    }

    UnityEngine.Debug.Log( stopwatch.ElapsedMilliseconds );

    Marshal.FreeHGlobal( ptr );
}

I'm experimenting with unsafe code for an audio application which is quite performance critical. I'm hoping to increase performance and decrease / eliminate garbage collection.

Any insights appreciated!

Not really an answer, but needs more space than comment.

If you use ILSpy to observe IL code, then difference is (release, default settings, my PC: Windows 7 64):

        // unmanaged        
        IL_005a: ldloc.s unmanagedBuffer
        IL_005c: ldloc.1
        IL_005d: conv.i
        IL_005e: ldc.i4.4
        IL_005f: mul
        IL_0060: add
        IL_0061: dup
        IL_0062: ldind.r4
        IL_0063: ldc.r4 1.6745
        IL_0068: mul
        IL_0069: stind.r4
        IL_006a: ldloc.1
        IL_006b: ldc.i4.1
        IL_006c: add
        IL_006d: stloc.1

        // managed
        IL_00a4: ldloc.2
        IL_00a5: ldloc.1
        IL_00a6: ldelema [mscorlib]System.Single
        IL_00ab: dup
        IL_00ac: ldobj [mscorlib]System.Single
        IL_00b1: ldc.r4 1.6745
        IL_00b6: mul
        IL_00b7: stobj [mscorlib]System.Single
        IL_00bc: ldloc.1
        IL_00bd: ldc.i4.1
        IL_00be: add
        IL_00bf: stloc.1

I don't know how much machine code correspond to each IL instruction, but it might be optimization problem (see how much work required to calculate index in case of unmanaged buffer).


I noticed non-linear correlation between number of iterations and time:

First column is numIterations , second is unamanged time (ms), last one - managed time (ms).

Until 170 it's linear and then something start happening (disregards of increment, on screen is 10 , I tried 5 it's also good until 170 ). This bugs me and I'd really want to get real answer here.

Not an answer, but I need space. Using C# nad VS13 I saw different assembly for multiplication.

UNMANAGED

00007FFC013555D9  movsxd      rcx,dword ptr [rbp+0D8h]  
00007FFC013555E0  mov         rax,qword ptr [rbp+0C0h]  
00007FFC013555E7  lea         rax,[rax+rcx*4]  
00007FFC013555EB  mov         qword ptr [rbp+50h],rax  
00007FFC013555EF  mov         rax,qword ptr [rbp+50h]  
00007FFC013555F3  movss       xmm0,dword ptr [7FFC013558A0h]  
00007FFC013555FB  mulss       xmm0,dword ptr [rax]  
00007FFC013555FF  mov         rax,qword ptr [rbp+50h]  
00007FFC01355603  movss       dword ptr [rax],xmm0 

MANAGED

00007FFC01355722  movsxd      rcx,dword ptr [rbp+0D8h]  
00007FFC01355729  mov         rax,qword ptr [rbp+0D0h]  
00007FFC01355730  mov         rax,qword ptr [rax+8]  
00007FFC01355734  mov         qword ptr [rbp+78h],rcx  
00007FFC01355738  cmp         qword ptr [rbp+78h],rax  
00007FFC0135573C  jae         00007FFC01355748  
00007FFC0135573E  mov         rax,qword ptr [rbp+78h]  
00007FFC01355742  mov         qword ptr [rbp+78h],rax  
00007FFC01355746  jmp         00007FFC0135574D  
00007FFC01355748  call        00007FFC60E86590  
00007FFC0135574D  mov         rcx,qword ptr [rbp+0D0h]  
00007FFC01355754  mov         rax,qword ptr [rbp+78h]  
00007FFC01355758  lea         rax,[rcx+rax*4+10h]  
00007FFC0135575D  mov         qword ptr [rbp+80h],rax  
00007FFC01355764  mov         rax,qword ptr [rbp+80h]  
00007FFC0135576B  movss       xmm0,dword ptr [7FFC013558A0h]  
00007FFC01355773  mulss       xmm0,dword ptr [rax]  
00007FFC01355777  mov         rax,qword ptr [rbp+80h]  
00007FFC0135577E  movss       dword ptr [rax],xmm0 

Obviously the big the code the slow the execution...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM