簡體   English   中英

零爭用的 std::mutex 可怕的可擴展性

[英]Horrible scalability of std::mutex with zero contention

即使沒有競爭, std::mutex的可擴展性似乎也很糟糕。 這是保證每個線程都使用自己的互斥鎖的情況。 到底是怎么回事?

#include <mutex>
#include <vector>
#include <numeric>

void TestThread(bool *pbFinished, int* pResult)
{
    std::mutex mtx;
    for (; !*pbFinished; (*pResult)++)
    {
        mtx.lock();
        mtx.unlock();
    }
}

void Test(int coreCnt)
{
    const int ms = 3000;
    bool bFinished = false;
    std::vector<int> results(coreCnt);    
    std::vector<std::thread*> threads(coreCnt);

    for (int i = 0; i < coreCnt; i++)
        threads[i] = new std::thread(TestThread, &bFinished, &results[i]);

    std::this_thread::sleep_for(std::chrono::milliseconds(ms));

    bFinished = true;
    for (std::thread* pThread: threads)
        pThread->join();

    int sum = std::accumulate(results.begin(), results.end(), 0);
    printf("%d cores: %.03fm ops/sec\n", coreCnt, double(sum)/double(ms)/1000.);
}

int main(int argc, char** argv)
{
    for (int cores = 1; cores <= (int)std::thread::hardware_concurrency(); cores++)
        Test(cores);

    return 0;
}

Windows 中的結果非常糟糕:

1 cores: 15.696m ops/sec
2 cores: 12.294m ops/sec
3 cores: 17.134m ops/sec
4 cores: 9.680m ops/sec
5 cores: 13.012m ops/sec
6 cores: 21.142m ops/sec
7 cores: 18.936m ops/sec
8 cores: 18.507m ops/sec

Linux 輸得更大:

1 cores: 46.525m ops/sec
2 cores: 15.089m ops/sec
3 cores: 15.105m ops/sec
4 cores: 14.822m ops/sec
5 cores: 14.519m ops/sec
6 cores: 14.544m ops/sec
7 cores: 13.996m ops/sec
8 cores: 13.869m ops/sec

我也嘗試過使用 tbb 的讀者/作者鎖,甚至還推出了我自己的鎖。

我通過以下更改制作了自己的測試變體:

  • 每個測試線程執行特定次數的迭代,而不是特定的時間量。 每個線程返回運行迭代次數所花費的時間。 (為了測試,我使用了 2000 萬次迭代)。

  • 編排線程的主線程等待每個線程發出“信號”表明它已准備好開始。 然后主線程,在看到所有線程都“准備好”后,它向所有測試發出“開始”信號。 這些信號基本上是 condition_variables。 這基本上消除了一個線程開始執行而另一個線程正在預熱 kernel 的性能噪音。

  • 在線程退出返回結果之前,線程不會嘗試訪問全局變量。

  • 當所有線程都完成后,迭代總數將根據每個線程花費的總時間計算。

  • 使用高分辨率時鍾來測量每個線程中使用的時間

struct TestSignal
{
    std::mutex mut;
    std::condition_variable cv;
    bool isReady;

    TestSignal() : isReady(false)
    {

    }

    void Signal()
    {
        mut.lock();
        isReady = true;
        mut.unlock();
        cv.notify_all();
    }

    void Wait()
    {
        std::unique_lock<std::mutex> lck(mut);
        cv.wait(lck, [this] {return isReady; });
    }
};

long long TestThread2(long long iterations, TestSignal& signalReady, TestSignal& signalGo)
{
    std::mutex mtx;

    signalReady.Signal(); // signal to the main thread we're ready to proceed
    signalGo.Wait();      // wait for the main thread to tell us to start

    auto start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < iterations; i++)
    {
        mtx.lock();
        mtx.unlock();
    }

    auto end = std::chrono::high_resolution_clock::now();

    auto diff = end - start;

    auto milli = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    return milli.count(); // return how long it took to execute the iterations
}


void Test2(unsigned int threadcount)
{

    long long iterations = 20000000; // 20 million

    std::vector<std::thread> threads(threadcount);
    std::vector<TestSignal> readySignals(threadcount);
    std::vector<long long> results(threadcount);

    TestSignal signalGo;

    for (unsigned int i = 0; i < threadcount; i++)
    {
        auto t = std::thread([&results, &readySignals, &signalGo, i, iterations] {results[i] = TestThread2(iterations, readySignals[i], signalGo); });
        readySignals[i].Wait();
        threads[i] = std::move(t);
    }

    std::this_thread::sleep_for(std::chrono::milliseconds(500));

    signalGo.Signal(); // unleash the threads

    for (unsigned int i = 0; i < threadcount; i++)
    {
        threads[i].join();
    }

    long long totaltime = 0;
    double totalrate = 0;
    for (unsigned int i = 0; i < threadcount; i++)
    {
        double rate = iterations / (double)(results[i]); // operations per millisecond
        totalrate += rate;
    }

    std::cout << threadcount << " threads: " << totalrate/1000 << "m ops/sec (new test)\n";
    
}

然后一個簡單的 main 將兩個結果進行 3 次比較:

int main()
{
#ifdef WIN32
    ::SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS);
#endif

    Test(std::thread::hardware_concurrency());
    Test2(std::thread::hardware_concurrency());

    Test(std::thread::hardware_concurrency());
    Test2(std::thread::hardware_concurrency());

    Test(std::thread::hardware_concurrency());
    Test2(std::thread::hardware_concurrency());


    return 0;
}

結果明顯不同:

12 cores: 66.343m ops/sec
12 threads: 482.187m ops/sec (new test)
12 cores: 111.061m ops/sec
12 threads: 474.199m ops/sec (new test)
12 cores: 66.758m ops/sec
12 threads: 481.353m ops/sec (new test)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM