VC ++：性能下降x20当线程多于cpus但不在g ++下

Question

Simple multithreaded c++11 program where all threads lock same mutex in tight loop. 简单的多线程c ++ 11程序，其中所有线程在紧密循环中锁定相同的互斥锁。

When it uses 8 threads (as the number of logical cpus) it can reach 5 million locks/second 当它使用8个线程（作为逻辑cpus的数量）时，它可以达到500万次/秒

But add just one additional thread - and the performance drops To 200,000/sec ! 但只添加一个额外的线程 - 性能下降到200,000 /秒！

Edit: 编辑：

Under g++4.8.2 (ubuntu x64): No performance degradation at all even with 100 threads! 在g ++ 4.8.2（ubuntu x64）下：即使有100个线程也没有性能下降！ (and more than twice the performance but that's another story) - So this indeed seems to be a problem specific to VC++ mutex implementation （并且是性能的两倍以上，但这是另一个故事） - 所以这确实是VC ++互斥实现的特定问题

I reproduced it with the following code (Windows 7 x64): 我使用以下代码（Windows 7 x64）复制它：

#include <chrono>
#include <thread>
#include <memory>
#include <mutex>
#include <atomic>
#include <sstream>
#include <iostream>

using namespace std::chrono;

void thread_loop(std::mutex* mutex, std::atomic_uint64_t* counter)
{
    while (true)
    {
        std::unique_lock<std::mutex> ul(*mutex);        
        counter->operator++();                    
    }        
}

int _tmain(int argc, _TCHAR* argv[])
{    

    int threads = 9;
    std::mutex mutex;
    std::atomic_uint64_t counter = 0;

    std::cout << "Starting " << threads << " threads.." << std::endl;
    for (int i = 0; i < threads; ++i)
        new std::thread(&thread_loop, &mutex, &counter);

    std::cout << "Started " << threads << " threads.." << std::endl;
    while (1)
    {   
        counter = 0;
        std::this_thread::sleep_for(seconds(1));        
        std::cout << "Counter = " << counter.load() << std::endl;                
    }    
}

The VS 2013 profiler tells me that most of time (95.7%) is wasted in a tight loop (line 697 in rtlocks.cpp): VS 2013剖析器告诉我，大部分时间（95.7％）都是在紧密循环中浪费的（rtlocks.cpp中的第697行）：

while (IsBlocked() & & spinWait._SpinOnce())
{
//_YieldProcessor is called inside _SpinOnce
}

What could be the cause? 可能是什么原因？ How can this be improved? 如何改进？

OS: windows 7 x64 操作系统：Windows 7 x64

CPU: i7 3770 4 cores (x2 hyper threading) CPU：i7 3770 4核（x2超线程）

Answer 1

With 8 threads your code is spinning, but getting the lock without the CPU having to suspend the thread before it looses its time slice. 使用8个线程，您的代码正在旋转，但是在没有CPU必须在线程丢失其时间片之前挂起线程的情况下获取锁定。

As you add more and more threads the contention level increases, and therefore the chance that the thread will not be able to acquire the lock within its timeslice. 随着您添加越来越多的线程，争用级别会增加，因此线程无法在其时间片内获取锁定。 When this happens the thread is suspended and a context swith occurs to another thread, which the CPU will examine to see if the thread can be woken up. 当发生这种情况时，线程被挂起并且上下文开关发生在另一个线程中，CPU将检查该线程是否可以被唤醒。

All this swithing, suspending and waking up requires a transition from user mode to kernel mode, and this is an expensive operation, thus performace is significantly impacted. 所有这些切换，挂起和唤醒都需要从用户模式转换到内核模式，这是一项昂贵的操作，因此性能会受到很大影响。

To improve things either reduce the number of threads contending the lock or increase the number of cores available. 要改进一些事情，要么减少争用锁定的线程数量，要么增加可用内核数量。 In your example you're using a std::atomic number, so you don't need to lock in order to call ++ on it, since it's already thread safe. 在您的示例中，您使用的是std::atomic数字，因此您无需锁定以便在其上调用++ ，因为它已经是线程安全的。

Answer 2

The mutex gives contention between each of the threads anyway, however if you try to use more threads than you have cores, even if they are ready, not all of them can run at once, so they will need to keep stopping and starting - known as context switching. 互斥体无论如何都会在每个线程之间产生争用，但是如果你尝试使用的线程多于核心线程，即使它们已经准备就绪，也不是所有线程都可以同时运行，所以它们需要保持停止和启动 - 已知作为上下文切换。

The only way you can "solve" this is to use fewer threads or get more cores. 您可以“解决”这个问题的唯一方法是使用更少的线程或获得更多内核。

Answer 3

Your problem is there are 8 threads store to a shared resource (not load , load a shared resource which can't modified is safe, and lock is needless). 您的问题是有8个线程存储到共享资源（不加载，加载无法修改的共享资源是安全的，并且锁是不必要的）。

8 threads > core num means 8个线程>核心数量均值
- not every thread can run in a single cpu 并非每个线程都可以在单个cpu中运行
- there are more task schedules 有更多的任务计划
mutex 互斥
- the thread can't acquired the mutext will sleep, and queued this thread to wait queue.(It seems the mutex implementation in windows use a short spin, then queued this thread to wait queue if not acquired the mutex?) 线程无法获取mutext会睡眠，并将此线程排队等待队列。（似乎windows中的互斥实现使用短旋转，然后将此线程排队等待队列，如果没有获取互斥锁？）

Write lock-free algorithm is hard, but in your problem, there is a way. 写无锁算法很难，但在你的问题中，有一种方法。

If you can get more cores, get them 如果你可以获得更多核心，那就去获取它们
use std::atomic<uint64_t> and delete the mutex, increase an atomic number is atomic by default(no special memory model). 使用std::atomic<uint64_t>并删除互斥锁，默认情况下增加一个原子序数是原子的（没有特殊的内存模型）。
If the thread num is not constant, then change it to the core num, and then bind them 如果线程num不是常量，则将其更改为核心num，然后绑定它们

#include <chrono>
#include <thread>
#include <memory>
#include <atomic>
#include <sstream>
#include <iostream>

using namespace std::chrono;

void thread_loop(std::atomic<uint64_t>* counter)
{
    while (true)
    {
            (*counter)++;
    }
}

int main(int argc, char* argv[])
{

    int threads = 9;
    std::atomic<uint64_t> counter(0);

    std::cout << "Starting " << threads << " threads.." << std::endl;
    for (int i = 0; i < threads; ++i)
        new std::thread(&thread_loop, &counter);

    std::cout << "Started " << threads << " threads.." << std::endl;
    while (1)
    {
        std::this_thread::sleep_for(seconds(1));
        std::cout << "Counter = " << counter.load() << std::endl;
    }
}

This maybe faster. 这可能更快。 enjoy ;-) 请享用 ;-）

VC ++：性能下降x20当线程多于cpus但不在g ++下

问题描述

3 个解决方案

解决方案1
8 2014-01-21 10:32:56

解决方案2
5 2014-01-21 10:29:30

解决方案3
1 2014-01-21 12:35:44

VC ++：性能下降x20当线程多于cpus但不在g ++下

问题描述

3 个解决方案

解决方案1 8 2014-01-21 10:32:56

解决方案2 5 2014-01-21 10:29:30

解决方案3 1 2014-01-21 12:35:44

解决方案1
8 2014-01-21 10:32:56

解决方案2
5 2014-01-21 10:29:30

解决方案3
1 2014-01-21 12:35:44