How can I make this parallel sum function use vector instructions?

Question

As sort of a side project, I'm working on a multithreaded sum algorihm, which would outperform std::accumulate when working on a large enough array. First I'm going to describe my thought process leading up to this, but if you want to skip straight to the problem, feel free to scroll down to that part.

I found many parallel sum algorihms online, most of which take the following approach:

template <typename T, typename IT>
T parallel_sum(IT _begin, IT _end, T _init) {
    const auto size = distance(_begin, _end);
    static const auto n = thread::hardware_concurrency();
    if (size < 10000 || n == 1) return accumulate(_begin, _end, _init);
    vector<future<T>> partials;
    partials.reserve(n);
    auto chunkSize = size / n;
    for (unsigned i{ 0 }; i < n; i++) {
        partials.push_back(async(launch::async, [](IT _b, IT _e){
            return accumulate(_b, _e, T{0});
        }, next(_begin, i*chunkSize), (i==n-1)?_end:next(_begin, (i+1)*chunkSize)));
    }
    for (auto& f : partials) _init += f.get();
    return _init;
}

Assuming there are 2 threads available (as reported by thread::hardware_concurrency() ), this function would access the elements in memory the following way:

As a simple example, we are looking at 8 elements here. The two threads are indicated by red and blue. The arrows show the location from with the threads wish to load data. Once the cells turn either red or blue, they have been loaded by the corresponding thread.

This approach (at least in my opinion) is not the best, since the threads load data from different parts of memory simultaneously. If you have many processing threads, say 16 on an 8-core hyper-threaded CPU, or even more than that, the CPU's prefetcher would have a very hard time keeping up with all these reads from completely different parts of memory (assuming the array is far too big to fit in cache). This is why I think the second example should be faster:

template <typename T, typename IT>
T parallel_sum2(IT _begin, IT _end, T _init) {
    const auto size = distance(_begin, _end);
    static const auto n = thread::hardware_concurrency();
    if (size < 10000 || n == 1) return accumulate(_begin, _end, _init);
    vector<future<T>> partials;
    partials.reserve(n);
    for (unsigned i{ 0 }; i < n; i++) {
        partials.push_back(async(launch::async, [](IT _b, IT _e, unsigned _s){
            T _ret{ 0 };
            for (; _b < _e; advance(_b, _s)) _ret += *_b;
            return _ret;
        }, next(_begin, i), _end, n));
    }
    for (auto& f : partials) _init += f.get();
    return _init;
}

This function accesses memory in a sort-of-sequential way, like so:

This way the prefetcher is always able to stay ahead, since all the threads access the same-ish part of memory, so there should be less cache misses, and faster load times over all, at least I think so.

The problem is that while this is all fine and dandy in theory, actual compiled versions of these show a different result. The second one is way slower. I dug a little deeper into the problem, and found out that the assembly code that is produced for the actual addition is very different. These are the "hot loops" in each one that perform the addition (remember that the first one uses std::accumulate internally, so you're basically looking at that):

Please ignore the percentages and the colors, my profiler sometimes gets things wrong.

I noticed that std::accumulate when compiled, uses an AVX2 vector instruction, vpaddq . This can add four 64-bit integers at once. I think the reason why the second version cannot be vectorized, is that each thread only accesses one element at a time, then skips over some. The vector addition would load several contiguous elements then add them together. Clearly this cannot be done, since the threads don't load elements contiguously. I tried manually unrolling the for loop in the second version, and that vector instruction did appear in the assembly, but the whole thing became painfully slow for some reason.

The above results and assembly code comes from a gcc-compiled version, but the same kind of behavior can be observed with Visual Studio 2015 as well, although I haven't looked at the assembly it produces.

So is there a way to take advante of vector instructions while retaining this sequential memory access model? Or is this memory access method something that would help at all when compared to the first version of the function?

I wrote a little benchmark program , which is ready to compile and run, just in case you want to see the performance yourself.

PS.: My primary target hardware is modern x86_64 (like haswell and such).

Answer 1

Each core has its own cache and prefetching.

You should look at each thread as independently executing program. In this case shortcomings of second approach will be clear: you do not access sequental data in single thread. There are holes which should not be processed, so thread cannot use vector instructions.

Another problem: CPU prefetches data in chunks. Due to how different cache levels work, changing some data within chunk marks that cache stale, and if other core tries to do some operation on same chunk of data it will be required to wait until first core will write changes and retrieve that chunk again. Basicly in your second example cache is always stale and you see raw memory access perfomance.

The best way to handle concurrent processing is to process data in large sequental chunks.

How can I make this parallel sum function use vector instructions?

Question

1 answers

solution1
3 ACCPTED 2015-12-12 21:51:31

How can I make this parallel sum function use vector instructions?

Question

1 answers

solution1 3 ACCPTED 2015-12-12 21:51:31

solution1
3 ACCPTED 2015-12-12 21:51:31