简体   繁体   English

pthread mutex vs Solaris中的原子操作

[英]pthread mutex vs atomic ops in Solaris

I was doing some tests with a simple program measuring the performance of a simple atomic increment on a 64 bit value using an atomic_add_64 vs a mutex lock approach. 我正在使用一个简单的程序进行一些测试,该程序使用atomic_add_64与互斥锁方法测量64位值上的简单原子增量的性能。 What is puzzling me is the atomic_add is slower than the mutex lock by a factor of 2. 令我困惑的是atomic_add比互斥锁慢了2倍。

EDIT!!! 编辑!!! I've done some more testing. 我做了一些测试。 Looks like atomics are faster than mutex and scale up to 8 concurrent threads. 看起来像atomics比mutex更快,并且可以扩展到8个并发线程。 After that the performance of atomics degrades significantly. 之后,原子的性能显着下降。

The platform I've tested is: 我测试的平台是:

SunOS 5.10 Generic_141444-09 sun4u sparc SUNW,Sun-Fire-V490 SunOS 5.10 Generic_141444-09 sun4u sparc SUNW,Sun-Fire-V490

CC: Sun C++ 5.9 SunOS_sparc Patch 124863-03 2008/03/12 CC:Sun C ++ 5.9 SunOS_sparc Patch 124863-03 2008/03/12

The program is quite simple: 该程序非常简单:

#include <stdio.h>
#include <stdint.h>
#include <pthread.h>
#include <atomic.h>

uint64_t        g_Loops = 1000000;
volatile uint64_t       g_Counter = 0;
volatile uint32_t       g_Threads = 20;

pthread_mutex_t g_Mutex;
pthread_mutex_t g_CondMutex;
pthread_cond_t  g_Condition;

void LockMutex() 
{ 
  pthread_mutex_lock(&g_Mutex); 
}

void UnlockMutex() 
{ 
   pthread_mutex_unlock(&g_Mutex); 
}

void InitCond()
{
   pthread_mutex_init(&g_CondMutex, 0);
   pthread_cond_init(&g_Condition, 0);
}

void SignalThreadEnded()
{
   pthread_mutex_lock(&g_CondMutex);
   --g_Threads;
   pthread_cond_signal(&g_Condition);
   pthread_mutex_unlock(&g_CondMutex);
}

void* ThreadFuncMutex(void* arg)
{
   uint64_t counter = g_Loops;
   while(counter--)
   {
      LockMutex();
      ++g_Counter;
      UnlockMutex();
   }
   SignalThreadEnded();
   return 0;
}

void* ThreadFuncAtomic(void* arg)
{
   uint64_t counter = g_Loops;
   while(counter--)
   {
      atomic_add_64(&g_Counter, 1);
   }
   SignalThreadEnded();
   return 0;
}


int main(int argc, char** argv)
{
   pthread_mutex_init(&g_Mutex, 0);
   InitCond();
   bool bMutexRun = true;
   if(argc > 1)
   {
      bMutexRun = false;
      printf("Atomic run!\n");
   }
   else
        printf("Mutex run!\n");

   // start threads
   uint32_t threads = g_Threads;
   while(threads--)
   {
      pthread_t thr;
      if(bMutexRun)
         pthread_create(&thr, 0,ThreadFuncMutex, 0);
      else
         pthread_create(&thr, 0,ThreadFuncAtomic, 0);
   }
   pthread_mutex_lock(&g_CondMutex);
   while(g_Threads)
   {
      pthread_cond_wait(&g_Condition, &g_CondMutex);
      printf("Threads to go %d\n", g_Threads);
   }
   printf("DONE! g_Counter=%ld\n", (long)g_Counter);
}

The results of a test run on our box is: 在我们的盒子上进行测试的结果是:

$ CC -o atomictest atomictest.C
$ time ./atomictest
Mutex run!
Threads to go 19
...
Threads to go 0
DONE! g_Counter=20000000

real    0m15.684s
user    0m52.748s
sys     0m0.396s

$ time ./atomictest 1
Atomic run!
Threads to go 19
...
Threads to go 0
DONE! g_Counter=20000000

real    0m24.442s
user    3m14.496s
sys     0m0.068s

Did you run into this type of performance difference on Solaris? 您是否在Solaris上遇到过这种类型的性能差异? Any ideas why this happens? 任何想法为什么会这样?

On Linux the same code (using the gcc __sync_fetch_and_add) yields a 5-fold performance improvement over the mutex verstion. 在Linux上,相同的代码(使用gcc __sync_fetch_and_add)比互斥锁的性能提高了5倍。

Thanks, Octav 谢谢,Octav

You have to be careful what is happening here. 你必须小心这里发生的事情。

  1. It takes significant time to create a thread. 创建一个线程需要很长时间。 Thus, its likely that not all the threads are executing simultaneously. 因此,可能并非所有线程都同时执行。 As evidence, I took your code and removed the mutex lock and got the correct answer every time I ran it. 作为证据,我拿走了你的代码并删除了互斥锁,每次运行时都得到了正确的答案。 This means that none of the threads were executing at the same time! 这意味着没有一个线程同时执行! You should not count the time to create/destruct threads in your test. 您不应该计算在测试中创建/销毁线程的时间。 You should wait till all threads are created and running before you start the test. 在开始测试之前,您应该等到所有线程都已创建并运行。

  2. Your test isn't fair. 你的考试不公平。 Your test has artificially very high lock contention. 您的测试具有人为的非常高的锁争用性。 For whatever reason, the atomic add_and_fetch suffers in that situation. 无论出于何种原因,原子add_and_fetch都会遇到这种情况。 In real life, you would do some work in the thread. 在现实生活中,你会在线程中做一些工作。 Once you add even a little bit of work, the atomic ops perform a lot better. 一旦你添加了一点点工作,原子操作就会好得多。 This is because the chance of a race condition has dropped significantly. 这是因为竞争条件的可能性大幅下降。 The atomic op has lower overhead when there is no contention. 当没有争用时,原子操作具有较低的开销。 The mutex has more overhead than the atomic op when there is no contention. 当没有争用时,互斥锁比原子操作具有更多的开销。

  3. Number of threads. 线程数。 The fewer threads running, the lower the contention. 运行的线程越少,争用就越少。 This is why fewer threads do better for the atomic in this test. 这就是为什么在此测试中更少的线程对原子更好。 Your 8 thread number might be the number of simultaneous threads your system supports. 您的8个线程号可能是系统支持的并发线程数。 It might not be because your test was so skewed towards contention. 这可能不是因为你的考试偏向争用。 It would seem to me that your test would scale to the number of simultaneous threads allowed and then plateau. 在我看来,您的测试将扩展到允许的同时线程数,然后达到平台。 One thing I cannot figure out is why, when the # of threads gets higher than the number of simultaneous threads the system can handle, we don't see evidence of the situation where the mutex is left locked while the thread sleeps. 我无法弄清楚的一件事是,当线程数高于系统可以处理的同时线程数时,我们没有看到线程休眠时互斥锁被锁定的情况的证据。 Maybe we do, I just can't see it happening. 也许我们这样做,我只是看不到它发生。

Bottom line, the atomics are a lot faster in most real life situations. 最重要的是,原子在大多数现实生活中都要快得多。 They are not very good when you have to hold a lock for a long time...something you should avoid anyway (well in my opinion at least!) 当你需要长时间锁定时它们不是很好......无论如何你应该避免这种情况(至少在我看来!)

I changed your code so you can test with no work, barely any work, and a little more work as well as change the # of threads. 我改变了你的代码,这样你就可以在没有工作的情况下进行测试,几乎不需要任何工作,也可以进行更多的工作以及更改线程数。

6sm = 6 threads, barely any work, mutex 6s = 6 threads, barely any work, atomic 6sm = 6个线程,几乎没有任何工作,互斥6s = 6个线程,几乎没有任何工作,原子

use a capitol S to get more work, and no s to get no work. 使用国会大厦S来获得更多的工作,没有人可以得到任何工作。

These results show that with 10 threads, the amount of work affects how much faster atomics are. 这些结果表明,使用10个线程,工作量会影响原子的速度。 In the first case, there is no work, and the atomics are barely faster. 在第一种情况下,没有工作,原子几乎没有更快。 Add a little work and the gap doubles to 6 sec, and a lot of work and it almost gets to 10 sec. 添加一点工作,间隙加倍到6秒,大量的工作,它几乎达到10秒。

(2) /dev_tools/Users/c698174/temp/atomic 
[c698174@shldvgfas007] $ t=10; a.out $t ; a.out "$t"m
ATOMIC FAST g_Counter=10000000 13.6520 s
MUTEX  FAST g_Counter=10000000 15.2760 s

(2) /dev_tools/Users/c698174/temp/atomic 
[c698174@shldvgfas007] $ t=10s; a.out $t ; a.out "$t"m
ATOMIC slow g_Counter=10000000 11.4957 s
MUTEX  slow g_Counter=10000000 17.9419 s

(2) /dev_tools/Users/c698174/temp/atomic 
[c698174@shldvgfas007] $ t=10S; a.out $t ; a.out "$t"m
ATOMIC SLOW g_Counter=10000000 14.7108 s
MUTEX  SLOW g_Counter=10000000 23.8762 s

20 threads, atomics still better, but by a smaller margin. 20个线程,原子仍然更好,但是更小的余量。 No work, they are almost the same speed. 没有工作,他们几乎是相同的速度。 With a lot of work, atomics take the lead again. 经过大量的工作,原子能再次领先。

(2) /dev_tools/Users/c698174/temp/atomic 
[c698174@shldvgfas007] $ t=20; a.out $t ; a.out "$t"m
ATOMIC FAST g_Counter=20000000 27.6267 s
MUTEX  FAST g_Counter=20000000 30.5569 s

(2) /dev_tools/Users/c698174/temp/atomic 
[c698174@shldvgfas007] $ t=20S; a.out $t ; a.out "$t"m
ATOMIC SLOW g_Counter=20000000 35.3514 s
MUTEX  SLOW g_Counter=20000000 48.7594 s

2 threads. 2个主题。 Atomics dominate. 原子论占主导地位。

(2) /dev_tools/Users/c698174/temp/atomic 
[c698174@shldvgfas007] $ t=2S; a.out $t ; a.out "$t"m
ATOMIC SLOW g_Counter=2000000 0.6007 s
MUTEX  SLOW g_Counter=2000000 1.4966 s

Here is the code (redhat linux, using gcc atomics): 这是代码(redhat linux,使用gcc atomics):

#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <pthread.h>

volatile uint64_t __attribute__((aligned (64))) g_Loops = 1000000 ;
volatile uint64_t __attribute__((aligned (64))) g_Counter = 0;
volatile uint32_t __attribute__((aligned (64))) g_Threads = 7; 
volatile uint32_t __attribute__((aligned (64))) g_Active = 0;
volatile uint32_t __attribute__((aligned (64))) g_fGo = 0;
int g_fSlow = 0;

#define true 1
#define false 0
#define NANOSEC(t) (1000000000ULL * (t).tv_sec + (t).tv_nsec)

pthread_mutex_t g_Mutex;
pthread_mutex_t g_CondMutex;
pthread_cond_t  g_Condition;

void LockMutex() 
{ 
  pthread_mutex_lock(&g_Mutex); 
}

void UnlockMutex() 
{ 
   pthread_mutex_unlock(&g_Mutex); 
}

void Start(struct timespec *pT)
{
   int cActive = __sync_add_and_fetch(&g_Active, 1);
   while(!g_fGo) {} 
   clock_gettime(CLOCK_THREAD_CPUTIME_ID, pT);
}

uint64_t End(struct timespec *pT)
{
   struct timespec T;
   int cActive = __sync_sub_and_fetch(&g_Active, 1);
   clock_gettime(CLOCK_THREAD_CPUTIME_ID, &T);
   return NANOSEC(T) - NANOSEC(*pT);
}
void Work(double *x, double z)
{
      *x += z;
      *x /= 27.6;
      if ((uint64_t)(*x + .5) - (uint64_t)*x != 0)
        *x += .7;
}
void* ThreadFuncMutex(void* arg)
{
   struct timespec T;
   uint64_t counter = g_Loops;
   double x = 0, z = 0;
   int fSlow = g_fSlow;

   Start(&T);
   if (!fSlow) {
     while(counter--) {
        LockMutex();
        ++g_Counter;
        UnlockMutex();
     }
   } else {
     while(counter--) {
        if (fSlow==2) Work(&x, z);
        LockMutex();
        ++g_Counter;
        z = g_Counter;
        UnlockMutex();
     }
   }
   *(uint64_t*)arg = End(&T);
   return (void*)(int)x;
}

void* ThreadFuncAtomic(void* arg)
{
   struct timespec T;
   uint64_t counter = g_Loops;
   double x = 0, z = 0;
   int fSlow = g_fSlow;

   Start(&T);
   if (!fSlow) {
     while(counter--) {
        __sync_add_and_fetch(&g_Counter, 1);
     }
   } else {
     while(counter--) {
        if (fSlow==2) Work(&x, z);
        z = __sync_add_and_fetch(&g_Counter, 1);
     }
   }
   *(uint64_t*)arg = End(&T);
   return (void*)(int)x;
}


int main(int argc, char** argv)
{
   int i;
   int bMutexRun = strchr(argv[1], 'm') != NULL;
   pthread_t thr[1000];
   uint64_t aT[1000];
   g_Threads = atoi(argv[1]);
   g_fSlow = (strchr(argv[1], 's') != NULL) ? 1 : ((strchr(argv[1], 'S') != NULL) ? 2 : 0);

   // start threads
   pthread_mutex_init(&g_Mutex, 0);
   for (i=0 ; i<g_Threads ; ++i)
         pthread_create(&thr[i], 0, (bMutexRun) ? ThreadFuncMutex : ThreadFuncAtomic, &aT[i]);

   // wait
   while (g_Active != g_Threads) {}
   g_fGo = 1;
   while (g_Active != 0) {}

   uint64_t nTot = 0;
   for (i=0 ; i<g_Threads ; ++i)
   { 
        pthread_join(thr[i], NULL);
        nTot += aT[i];
   }
   // done 
   printf("%s %s g_Counter=%llu %2.4lf s\n", (bMutexRun) ? "MUTEX " : "ATOMIC", 
    (g_fSlow == 2) ? "SLOW" : ((g_fSlow == 1) ? "slow" : "FAST"), g_Counter, (double)nTot/1e9);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM