简体   繁体   中英

Multithreading on Intel much slower than on AMD

I want to make code below parallelized:

for(int c=0; c<n; ++c) {
    Work(someArray, c);
}

I've done it this way:

#include <thread>
#include <vector>

auto iterationsPerCore = n/numCPU;
std::vector<std::future<void>> futures;

for(auto th = 0; th < numCPU; ++th) {
    for(auto n = th * iterationsPerCore; n < (th+1) * iterationsPerCore; ++n) {
        auto ftr = std::async( std::launch::deferred | std::launch::async,
            [n, iterationsPerCore, someArray]()
            {
                for(auto m = n; m < n + iterationsPerCore; ++m)
                    Work(someArray, m);
            }
        );
        futures.push_back(std::move(ftr));
    }

    for(auto& ftr : futures)
        ftr.wait();
}

// rest of iterations: n%iterationsPerCore
for(auto r = numCPU * iterationsPerCore; r < n; ++r)
    Work(someArray, r);

Problem is that it runs only 50% faster on Intel CPU, while on AMD it does 300% faster. I run it on three Intel CPUs (Nehalem 2core+HT, Sandy Bridge 2core+HT, Ivy Brigde 4core+HT). AMD processor is Phenom II x2 with 4 cores unlocked. On 2-core Intel processor it runs 50% faster with 4 threads. On 4-core, it runs 50% faster also on 4 threads. I'm testing with VS2012, Windows 7.

When I try it with 8 threads, it is 8x slower than serial loop on Intel. I suppose it is caused by HT.

What do you think about it? What's the reason of such behavior? Maybe code is not correct?

I'd suspect false sharing . This is what happens when two variables share the same cache line. Effectively, all operations on them have to be very expensively synchronized even if they are not accessed concurrently, as the cache can only operate in terms of cache lines of a certain size, even if your operations are more fine-grained. I would suspect that the AMD hardware is simply more resilient or has a different hardware design to cope with this.

To test, change the code so that each core only works on chunks which are multiples of 64bytes. This should avoid any cache line sharing, as the Intel CPUs only have a cache line of 64bytes.

I would say you need to change your compiler settings to make all the compiled code minimize the number of branches. The two different CPU styles have different operation look-ahead setups. You need to change the compiler optimization settings to match the target CPU, not the CPU upon which the code is compiled.

You sould also be awer of the cpu cache. Here is a good article on this topic.

The short version: the hw caches the data, but if you are working on the same memory (SomeArray) it has to sync all the time between the caches of the cpus, it can even cause to run slower then in a single threaded way.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM