Efficient creation of thread pool (C++)

Question

What is the 'best' way to create a thread pool for more efficient calculation?

Suppose I have the following code to print out how many primes are in a given interval (for demonstration only, I know it's super slow):

#include <future>
#include <iostream>
#include <thread>
#include <math.h>

bool is_prime(int n) {
  if (n == 2 || n == 3) {
    return 1;
  }
  else if (n % 2 == 0 || n % 3 == 0) {
    return 0;
  }
  for (int i = 5; i < sqrt(n) + 1; i = i + 6) {
    if (n % i == 0 || n % (i+2) == 0) {
      return 0;
    }
  }
  return 1;
}

int primes_in_range(int a, int b) {
  int total = 0;
  for (int i = a; i <= b; i++) {
    total += is_prime(i);
  }
  return total;
}

int main() {

  int total = primes_in_range(2, 10000000);
  std::cout << total << std::endl;

}

If I want to make this run faster by splitting the interval into smaller chunks for threads, how would I do so?

Currently, I'm doing something like this:

  auto thread1 = std::async(std::launch::async, primes_in_range, 2, 2500000);
  auto thread2 = std::async(std::launch::async, primes_in_range, 2500001, 5000000);
  auto thread3 = std::async(std::launch::async, primes_in_range, 5000001, 7500000);
  auto thread4 = std::async(std::launch::async, primes_in_range, 7500001, 10000000);
  int total1 = thread1.get();
  int total2 = thread2.get();
  int total3 = thread3.get();
  int total4 = thread4.get();
  std::cout << total1 + total2 + total3 + total4 << std::endl;

But this doesn't seem very efficient, especially if I try to have say n threads.

What is a better way to do it? I'm fairly new to multithreading in general, so do tell me if I am doing something terribly wrong!

Answer 1

Consider you would calculate the results for the intervals sequentially. Then you would use loops and you can do the same with std::asynch and std::future ( std::asynch does not return a thread).

auto get_future_chunk(int from, int to){
    return std::async(std::launch::async, primes_in_range, from,to);
}

int main() {
    std::vector<decltype(get_future_chunk(0,0))> futures;
    int from = 2;
    int chunk_size = 5000000;
    const int max = 10000000;    
    int to = from+chunk_size;
    while (to <= max) {
        futures.push_back(get_future_chunk(from,to));
        from = to + 1;
        to += chunk_size;        
    }
    futures.push_back(get_future_chunk(from,max));
    int total = 0;
    for (auto& f : futures) total += f.get();
    std::cout << total << "\n";
}

The only reason I wrote the function is because I was too lazy to look up the exact type of the future returned from asynch and with the help of the function I can deduce it more easily. For testing on godbolt I used only two chunks because when requesting lots of threads I got an error for unavailable resource. The code fixes the chunk_size and based on that determines the number of futures to be spawned. The reverse is of course possible as well: Fix the number of chunks and then calculate the interval bounds.

Complete example

Answer 2

An alternative solution is to simply use OpenMP which is supported by all mainstream C++ compilers so far (eg. GCC, Clang, ICC, and even MSVC though there are some quicks). You just need to compile your code with -fopenmp and use a #pragma omp parallel reduction(+:total) directive and OpenMP will create a thread pool, compute the operation in parallel and perform the synchronizations. It also recycle the thread pool so to create it only once (unless you change the number of threads or things like this). Here is an example:

int primes_in_range(int a, int b) {
  int total = 0;
  #pragma omp parallel reduction(+:total)
  for (int i = a; i <= b; i++) {
    total += is_prime(i);
  }
  return total;
}

OpenMP enable you to tweak the scheduling which is useful here since is_prime takes a variable amount of time so a load balancing can result in better performance (a schedule(guided) clause should result in faster performance but it is better to simply test the available schedule). If you need to defer computation, you can use OpenMP tasking primitives which are also pretty simple to use (and it avoids to re-implement the wheel yet another time by letting the OpenMP runtime does all the hard/boring/bug-prone work for you).

Efficient creation of thread pool (C++)

Question

2 answers

solution1
0 ACCPTED 2022-07-14 14:05:19

solution2
0 2022-07-14 14:27:05

Efficient creation of thread pool (C++)

Question

2 answers

solution1 0 ACCPTED 2022-07-14 14:05:19

solution2 0 2022-07-14 14:27:05

solution1
0 ACCPTED 2022-07-14 14:05:19

solution2
0 2022-07-14 14:27:05