Why is my parallel foreach loop implementation slower than the single-threaded one?

Question

I am trying to implement the parallel foreach loop for std::vector which runs the computations in optimal number of threads (number of cores minus 1 for main thread), however, my implementation seems to be not fast enough – it actually runs 6 times slower than the single-threaded one!

The thread instantiation is often blamed for being a bottleneck so I tried a larger vector, however, that did not seem to help.

I am currently stuck watching the parallel algorithm executed in 13000-20000 microseconds in a separate thread while single-threaded one is executed in 120-200 microseconds in the main thread and cannot figure out what I am doing wrong. Out of those 13-20 ms parallel algorithm runs for 8 or 9 are usually utilized to create thread, however, I can still see no reason for std::for_each running through 1/3 of the vector in a separate thread for several times longer than another std::for_each need to iterate through the whole vector.

#include <iostream>
#include <vector>
#include <thread>
#include <algorithm>
#include <chrono>

const unsigned int numCores = std::thread::hardware_concurrency();

const size_t numUse = numCores - 1;

struct foreach
{
    inline static void go(std::function<void(uint32_t&)>&& func, std::vector<uint32_t>& cont)
    {
        std::vector<std::thread> vec;
        vec.reserve(numUse);
        std::vector<std::vector<uint32_t>::iterator> arr(numUse + 1);
        size_t distance = cont.size() / numUse;
        for (size_t i = 0; i < numUse; i++)
            arr[i] = cont.begin() + i * distance;
        arr[numUse] = cont.end();
        for (size_t i = 0; i < numUse - 1; i++)
        {
            vec.emplace_back([&] { std::for_each(cont.begin() + i * distance, cont.begin() + (i + 1) * distance, func); });
        }
        vec.emplace_back([&] { std::for_each(cont.begin() + (numUse - 1) * distance, cont.end(), func); });
        for (auto &d : vec)
        {
            d.join();
        }
    }
};


int main()
{
    std::chrono::steady_clock clock;
    std::vector<uint32_t> numbers;
    for (size_t i = 0; i < 50000000; i++)
        numbers.push_back(i);
    std::chrono::steady_clock::time_point t0m = clock.now();
    std::for_each(numbers.begin(), numbers.end(), [](uint32_t& value) { ++value; });

    std::chrono::steady_clock::time_point t1m = clock.now();
    std::cout << "Single-threaded run executes in " << std::chrono::duration_cast<std::chrono::microseconds>(t1m - t0m).count() << "mcs\n";
    std::chrono::steady_clock::time_point t0s = clock.now();
    foreach::go([](uint32_t& i) { ++i; }, numbers);

    std::chrono::steady_clock::time_point t1s = clock.now();
    std::cout << "Multi-threaded run executes in " << std::chrono::duration_cast<std::chrono::microseconds>(t1s - t0s).count() << "mcs\n";
    getchar();
}

Is there a way I can optimize this and increase the performance?

The compiler I am using is Visual Studio 2017's one. Config is Release x86. I have also been advised to use a profiler and am currently figuring out how to use one.

I actually managed to get parallel code run faster than the regular one, however, this required vector of dozens of thousands of vectors of five elements. If anyone has advices on how to improve performance or where can I find better implementation to check its structure, that would be appreciated.

Answer 1

Thank you for providing some example code.

Getting good metrics (especially on parallel code) can be pretty tricky. Your metrics are tainted.

Use high_resolution_clock instead of steady_clock for profiling.
Don't include the thread startup time in your timing measurement. Thread launch/join is orders of magnitude longer than your actual work here. You should create the threads once and use condition variables to make them sleep until you signal them to work. This is not trivial, but it is essential that you don't measure the thread startup time.
Visual Studio has a profiler. You need to compile your code with release optimizations but also include the debug symbols (those are excluded in the default release configuration). I haven't looked into how to set this up manually because I usually use CMake and it sets up a RelWithDebInfo configuration automatically.

Another issue kind of related to having good metrics is that your "work" is just incrementing an integer. Is that really representative of the work your program is going to be doing? Increment is really fast. If you look at the assembly generated by your sequential version, everything gets inlined into a really short loop.

Lambdas have a very good chance of being inlined. But in your go function, you're casting the lambda to std::function . std::function has a very poor chance of being inlined. So if you want to keep the chance of getting the lambda inlined, you have to do some template tricks:

template <typename FUNC>
inline static void go(FUNC&& func, std::vector<uint32_t>& cont)

By manually inlining your code (I moved the contents of the go function to main ) and doing step 2 above, I was able to get the parallel version (4 threads on a hyperthreaded dual-core) to run in about 75% of the time. That's not particularly good scaling, but it's not bad considering that the original was already pretty fast. For a further optimization, I would use SIMD aka "vector" (different from std::vector except in the sense that they both relate to arrays) operations which will apply the increment to multiple array elements in one iteration.

You have a race condition here:

for (size_t i = 0; i < numUse - 1; i++)
{
    vec.emplace_back([&] { std::for_each(cont.begin() + i * distance, cont.begin() + (i + 1) * distance, func); });
}

because you set the default lambda capture to capture-by-reference, the i variable is a reference and that could cause some threads to check the wrong range or too long of a range. You could do this: [&, i] , but why risk shooting yourself in the foot again? Scott Meyers recommends against using default capture modes. Just do [&cont, &distance, &func, i]

UPDATE:

I think it's a fine idea to move your foreach to its own space. I think what you should do is separate the thread creation from task dispatch. That means you need some kind of signaling system (generally condition variables). You could look into thread pools.

An easy way to add threadpools is to use OpenMP, which Visual Studio 2017 has support for (OpenMP 2.0). A caveat is that there's no guarantee that the threads won't be created/destroyed during entry/exit of the parallel section (it's implementation dependent). So it trades off performance with ease of use.

If you can use C++17, it has a standard parallel for_each (the ExecutionPolicy overload). Most of the algorithmy standards functions do. https://en.cppreference.com/w/cpp/algorithm/for_each

As for using std::function you can use it, you just don't want your basic operation (the one that will be called 50,000,000 times) to be a std::function .

Bad:

void go(std::function<...>& func)
{
    std::thread t(std::for_each(v.begin(), v.end(), func));
    ...
}

...
go([](int& i) { ++i; });

Good:

void go(std::function<...>& func)
{
    std::thread t(func);
    ...
}

...
go([&v](){ std::for_each(v.begin(), v.end(), [](int& i) { ++i; })});

In the good version, the short inner lambda (ie ++i) gets inlined in the call to for_each. That's important because it gets called 50 million times. The call to the bigger lambda is not inlined (because it's converted to std::function ) but that's ok because it only gets called once per thread.

Why is my parallel foreach loop implementation slower than the single-threaded one?

Question

1 answers

solution1
2 ACCPTED 2019-02-16 08:13:17

Why is my parallel foreach loop implementation slower than the single-threaded one?

Question

1 answers

solution1 2 ACCPTED 2019-02-16 08:13:17

solution1
2 ACCPTED 2019-02-16 08:13:17