lock cmpxchg/inc/dec works strangely slow

Question

I have been testing Windows SRW lock performance and found a strange issue. I have the following test loop:

for (int i = 0; i < 100000000; ++i)
{
    AcquireSRWLockShared (&g_srwLock);
    ReleaseSRWLockShared (&g_srwLock);
}

It takes about 1.5s when I run it by a single thread and about 2.9s (per thread) when I run it by two threads at the same time. OK, then I have the following loop:

for (int i = 0; i < 100000000; ++i)
{
    _InterlockedIncrement (&g_state);
    _InterlockedDecrement (&g_state);
}

It takes about 1.1s when I run it by a single thread and about 5.6s (!!!, per thread) when I run it by two threads. What am I doing wrong?

I dug into the AcquireSRWLockShared code and found that it uses lock cmpxchg, so I tried my loop with it:

for (int i = 0; i < 100000000; ++i)
{
    _InterlockedCompareExchange (&g_state, 0, 0);
    _InterlockedCompareExchange (&g_state, 0, 0);
}

and got exactly the same result - about 5.6s for two threads. OK, then I copied the exact code of the AcquireSRWLockShared:

__declspec (naked) void __stdcall TestLock (volatile long *address)
{
    __asm
    {
        mov         edi,edi  
        push        ebp  
        mov         ebp,esp  
        push        esi  
        mov         esi,dword ptr [ebp+8]  
        push        11h  
        xor         ecx,ecx  
        mov         edx,esi  
        pop         eax  
        lock cmpxchg dword ptr [edx],ecx  
        mov         ecx,eax  
        cmp         ecx,11h  
        //jne         77685820  
        pop         esi  
        pop         ebp  
        ret         4  
    }
}

(I had to comment the jump since it goes to some other code), and again got 5.6s for two threads. So, what is wrong? Why the same code takes 2.9s when run from a library and 5.6s when run from my function?

My PC is i5-3570K @ 4.4GHz, 16Gb DDR3 RAM @ 1600MHz.

Answer 1

Consider how locked-increment works at the micro-code level: Grab a lock on the memory cache line containing the variable, pull the value into a register, increment the value, stick it back into memory. And the side-effect of this is that if another thread running on another core has the same memory cache line in its local cache, that cache line must be evicted (ie discarded) and then reloaded from memory (or at least from a higher cache level).

Compare-lock-exchange is much the same except that the modification is conditional: it still requires grabbing the lock on the cache line and pulling it into memory, if the compare succeeds, then it ends up being pretty much the same as locked-increment. But if the comparison fails, the cache line is not modified and hence it does not evict from the other core's cache.

In the first example (AcquireSRWLockShared followed by ReleaseSRWLockShared), the lock variable is being modified twice in each iteration, but by a single thread (single core). This is because although a bus lock will be taken by the other thread as it attempts to acquire the lock while it's held, the compare fails and so no modification is made and hence the cache line is not evicted from the thread/core holding the lock. Eventually when that thread releases the lock, the cache line is evicted and reloaded by the other core, but that only happens once per acquire/release pair.

In the other examples, you have two threads both trying (and presumably mostly succeeding) to take a lock on the cache line and modify it, leading to the other core's cache line eviction. And you're doing it twice in each loop, since the increment/decrement always succeeds (as does the compare-exchange with [0, 0]), so you're evicting the cache lines twice as often.

lock cmpxchg/inc/dec works strangely slow

Question

1 answers

solution1
0 2017-02-24 16:49:34

lock cmpxchg/inc/dec works strangely slow

Question

1 answers

solution1 0 2017-02-24 16:49:34

solution1
0 2017-02-24 16:49:34