在单个内核上运行的多个线程如何进行数据竞争？

Question

I have the following simple c++ source : 我有以下简单的c ++源代码：

#define CNTNUM 100000000
int iglbcnt = 0 ;
int iThreadDone = 0 ;

void *thread1(void *param)
{
    /*
    pid_t tid = syscall(SYS_gettid);
    cpu_set_t set;
    CPU_ZERO( &set );
    CPU_SET( 5, &set );
    if (sched_setaffinity( tid, sizeof( cpu_set_t ), &set ))
    {
        printf( "sched_setaffinity error" );
    }
    */
    pthread_detach(pthread_self());
    for(int idx=0;idx<CNTNUM;idx++)
        iglbcnt++ ;
    printf(" thread1 out \n") ;
    __sync_add_and_fetch(&iThreadDone,1) ;
}

int main(int argc, char **argv)
{
    pthread_t tid ;
    pthread_create(&tid , NULL, thread1, (void*)(long)1);
    pthread_create(&tid , NULL, thread1, (void*)(long)3);
    pthread_create(&tid , NULL, thread1, (void*)(long)5);
    while( 1 ){
        sleep( 2 ) ;
        if( iThreadDone >= 3 )
            printf("iglbcnt=(%d) \n",iglbcnt) ;
    }
}

If I run it , the answer should not be 300000000 for sure unless the source using __sync_add_and_fetch(iglbcnt, 1 ) instead of iglbcnt++ . 如果我运行它，答案肯定不会是300000000，除非源使用__sync_add_and_fetch（iglbcnt，1）而不是iglbcnt ++。

Then I try to run like numactl -C 5 ./x.exe , numactl try to affinity all 3 thread1 to run at core 5 , so in theory , there is only one of all 3 thread1 can be running at core 5 , and since iglbcnt is globar vars to all thread1 , I expect the answer would be 300000000 , unfortunately it is not all the time get 300000000 , sometimes come out like 292065873 . 然后，我尝试像numactl -C 5 ./x.exe一样运行，numactl尝试使所有3个线程1亲和在内核5上运行，因此从理论上讲，这3个线程1中只有一个可以在内核5上运行，并且由于iglbcnt是所有thread1的globar vars，我希望答案是3亿，不幸的是并非每次都得到3亿，有时像292065873那样出来。

I guess the reason why not always get 300000000 is that while doing context switch in core 5 , the value of iglbcnt still keep in cpu's store buffer , so when scheduler run another thread then value of iglbcnt in L1 cache would be different with value in cpu core 5's store buffer , that cause the answer comes 292065873 , not 300000000 . 我想为什么不总是获得300000000的原因是，在核心5中进行上下文切换时，iglbcnt的值仍保留在cpu的存储缓冲区中，因此当调度程序运行另一个线程时，L1缓存中iglbcnt的值将与cpu中的值不同核心5的存储缓冲区，导致答案来自292065873，而不是300000000。

This is only experiment , as I said __sync_add_and_fetch will solve problem, but still I like to know the detail to cause this result . 正如我所说的__sync_add_and_fetch将解决问题，但这只是实验，但我仍然想知道导致此结果的细节。

Edit : 编辑：

Both ++igblcnt and igblcnt++ produce the same code. ++igblcnt和igblcnt++产生相同的代码。

g++ --std=c++11 -S -masm=intel x.cpp ,(source ++iglbcnt) the following code come from xs : g ++ --std = c ++ 11 -S -masm = intel x.cpp，（源++ iglbcnt）以下代码来自xs：

.LFB11:
    .cfi_startproc
    push    rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    mov     rbp, rsp
    .cfi_def_cfa_register 6
    sub     rsp, 32
    mov     QWORD PTR [rbp-24], rdi
    call    pthread_self
    mov     rdi, rax
    call    pthread_detach
    mov     DWORD PTR [rbp-4], 0
    jmp     .L2
.L3:
    mov     eax, DWORD PTR iglbcnt[rip]
    add     eax, 1
    mov     DWORD PTR iglbcnt[rip], eax
    add     DWORD PTR [rbp-4], 1
.L2:
    cmp     DWORD PTR [rbp-4], 99999999
    jle     .L3
    mov     edi, OFFSET FLAT:.LC0
    call    puts
    lock add        DWORD PTR iThreadDone[rip], 1
    leave
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE11:
    .size   _Z7thread1Pv, .-_Z7thread1Pv
    .section        .rodata
.LC1:
    .string "iglbcnt=(%d) \n"
    .text

Edit2 : 编辑2：

for(int idx=0;idx<CNTNUM;idx++){
    asm volatile("":::"memory") ;
    iglbcnt++ ;
}

and then compile it by -O1 will works fine , add compiler-time memory barrier would help in this case . 然后通过-O1进行编译可以正常工作，在这种情况下，添加编译器时内存屏障将有所帮助。

Answer 1

igblcnt++ is a load, add, store sequence. igblcnt ++是加载，添加，存储序列。 This is performed without synchronization so threads (even if scheduled on the same core) will have a race because each of them has their own register context. 这是在没有同步的情况下执行的，因此线程（即使调度在同一内核上）也会产生竞争，因为每个线程都有自己的寄存器上下文。 A __sync_add_and_fetch instruction on igblcnt will resolve the race. igblcnt上的__sync_add_and_fetch指令将解决该竞争。

The load into a core's register takes place then the thread is switched out (it's registers are saved) another thread reads the same value and increments and stores it back to memory (perhaps hundreds of increments) and then the first thread is switched in with its stale value which is incremented and stored -losing thousands to millions of increments potentially (as you have seen). 加载到内核的寄存器中，然后将线程切换出（保存寄存器），另一个线程读取相同的值并递增，并将其存储回内存（可能数百次递增），然后将第一个线程与其线程一起切入已过时的值（已增加并存储）-可能损失数千到数百万的增量（如您所见）。

Answer 2

Threads running on one processor can have a data race if they are preemptively scheduled, meaning that an interrupt can occur at any moment which triggers a thread context switch. 如果抢先调度在一个处理器上运行的线程，则可能会引起数据争用，这意味着可以在触发线程上下文切换的任何时刻发生中断。 Threads then have to use mutual exclusion mechanisms like mutex objects, or else atomic instructions (together with a carefully designed algorithm). 然后，线程必须使用互斥机制，例如互斥对象或原子指令（以及精心设计的算法）。

Cooperatively scheduled threads on a single processor avoid data races implicitly. 单个处理器上的协作调度线程避免了隐式的数据争用。 Under cooperative threading on a single processor, one thread executes until it explicitly calls some function which switches context. 在单个处理器上的协作线程下，一个线程将执行直到它显式调用某些切换上下文的函数为止。 Any code which doesn't call such a function is free from interference from other threads. 任何不调用该函数的代码都不会受到其他线程的干扰。

在单个内核上运行的多个线程如何进行数据竞争？

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-08-03 00:48:31

解决方案2
0 2016-08-03 19:13:00

在单个内核上运行的多个线程如何进行数据竞争？

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-08-03 00:48:31

解决方案2 0 2016-08-03 19:13:00

解决方案1
2 已采纳 2016-08-03 00:48:31

解决方案2
0 2016-08-03 19:13:00