指定范围之间的C ++多线程素数计数器

Question

#include <math.h>
#include <sstream>
#include <iostream>
#include <mutex>
#include <stdlib.h>
#include <chrono>
#include <thread>

bool isPrime(int number) {
    int i;

    for (i = 2; i < number; i++) {
        if (number % i == 0) {
            return false;
        }
    }

    return true;
}

std::mutex myMutex;

int pCnt = 0;

int icounter = 0;

int limit = 0;


int getNext() {
    std::lock_guard<std::mutex> guard(myMutex);
    icounter++;
    return icounter;
}

void primeCnt() {
    std::lock_guard<std::mutex> guard(myMutex);
    pCnt++;
}

void primes() {
    while (getNext() <= limit)
        if (isPrime(icounter))
            primeCnt();
}

int main(int argc, char *argv[]) {
    std::stringstream ss(argv[2]);
    int tCount;
    ss >> tCount;

    std::stringstream ss1(argv[4]);
    int lim;
    ss1 >> lim;

    limit = lim;

    auto t1 = std::chrono::high_resolution_clock::now();

    std::thread *arr;
    arr = new std::thread[tCount];

    for (int i = 0; i < tCount; i++)
        arr[i] = std::thread(primes);

    for (int i = 0; i < tCount; i++)
        arr[i].join();

    auto t2 = std::chrono::high_resolution_clock::now();

    std::cout << "Primes: " << pCnt << std::endl;
    std::cout << "Program took: " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() <<
    " milliseconds" << std::endl;
    return 0;
}

Hello , im trying to find the amount of prime numbers between the user specified range, ie, 1-1000000 with a user specified amount of threads to speed up the process, however, it seems to take the same amount of time for any amount of threads compared to one thread. 您好，我试图在用户指定的范围（即1-1000000）和用户指定的线程数量之间查找质数的数量，以加快处理速度，但是，对于任何数量的线程，它似乎都花费相同的时间与一个线程相比。 Im not sure if its supposed to be that way or if theres a mistake in my code. 我不确定它是否应该是这种方式，或者我的代码中是否有错误。 thank you in advance! 先感谢您！

Answer 1

You don't see performance gain because time spent in isPrime() is much smaller than time which threads take when fighting on mutex. 您看不到性能提升，因为在isPrime()中花费的时间比在互斥体上进行战斗时线程花费的时间小得多。

One possible solution is to use atomic operations, as @The Badger suggested. 一种可行的解决方案是使用原子操作，如@The Badger建议的那样。 The other way is to partition your task into smaller ones and distribute them over your thread pool. 另一种方法是将任务划分为较小的任务，然后将其分配到线程池中。

For example, if you have n threads, then each thread should test numbers from i*(limit/n) to (i+1)*(limit/n) , where i is thread number. 例如，如果您有n线程，则每个线程都应测试从i*(limit/n)到(i+1)*(limit/n) ，其中i是线程号。 This way you wouldn't need to do any synchronization at all and your program would (theoretically) scale linearly. 这样，您根本不需要进行任何同步，并且程序（理论上）将线性扩展。

Answer 2

Multithreaded algorithms work best when threads can do a lot of work on their own. 当线程可以自己执行大量工作时，多线程算法最有效。

Imagine doing this in real life: you have a group of 20 humans that will do work for you, and you want them to test whether each number up to 1000 is prime. 想象一下在现实生活中这样做：您有20个人为您工作，并且您希望他们测试最多1000个数字是否为素数。 How will you do this? 您将如何做？

Would you hand each person a single number at a time, and ask them to come back to you to tell you if its prime and to receive another number? 您会一次给每个人分配一个号码，然后请他们回到您身边告诉您是否是素数，然后再接收一个号码吗？

Surely not; 当然不会； you would give each person a bunch of numbers to work on at once, and have them come back and tell you how many were prime and to receive another bunch of numbers. 您会给每个人一次处理一堆数字，然后让他们回来告诉您有多少个质数，并获得另一个数字。

Maybe even you'd divide up the entire set of numbers into 20 groups and tell each person to work on a group. 甚至您甚至可以将整个数字分成20组，并告诉每个人在一个组中工作。 (but then you run the risk of one person being slow and having everyone else sitting idle while you wait for that one person to finish... although there are so-called "work stealing" algorithms, but that's complicated) （但是这样一来，一个人可能会变慢，并让其他人在等待一个人完成工作时处于闲置状态……尽管有所谓的“窃取工作”算法，但这很复杂）

The same thing applies here; 同样的事情在这里适用。 you want each thread to do a lot of work on its own and keep its own tally, and only have to check back with the centralized information once in a while. 您希望每个线程自己做很多工作并保持自己的统计，而只需要偶尔核对集中的信息。

Answer 3

A better solution would be to use the Sieve of Atkin to find the primes (even the Sieve of Eratosthenes which is easier to understand is better), your basic algorithm is very poor to start with. 更好的解决方案是使用Atkin筛子来查找素数（甚至更容易理解的Eratosthenes筛子也更好），而您的基本算法一开始就很差。 It will for every number n in your interval do n checks in order to determine if it's prime and do this limit times. 它将对间隔中的每个数字n进行n检查，以确定是否为质数，并执行此limit次数。 This means that you're doing about limit*limit/2 checks - that's what we call O(n^2) complexity. 这意味着您正在执行limit*limit/2检查-这就是我们所说的O(n^2)复杂度。 The Sieve of Atkins OTOH only have to do O(n) operations to find all primes. 阿特金斯OTOH筛仅需进行O(n)运算即可找到所有素数。 If n is large it is hard to beat the algorithm that has fewer steps by performing the steps faster. 如果n大，则很难通过更快地执行步骤来击败具有较少步骤的算法。 Trying to fix a poor algorithm by throwing more resources on it is a bad strategy. 尝试通过在其上投入更多资源来修复性能不佳的算法是一个错误的策略。

Another problem with your implementation is that it has race conditions and therefore is broken to start with. 您的实现的另一个问题是它具有竞争条件，因此一开始就被破坏了。 It's often little use in optimizing something unless you first make sure it's working correctly. 除非您首先确保其工作正常，否则它通常很少用于优化某些内容。 The problem is in the primes function: 问题出在primes函数中：

void primes() {
    while (getNext() <= limit)
        if( isPrime(icounter) )
            primeCnt();
}

Between the getNext() and isPrime another thread may have increased the icounter and cause the program to skip candidates. 之间getNext()和isPrime另一个线程可能已经增加了icounter，程序会跳过候选人。 This results in the program giving different result each time. 这导致程序每次给出不同的结果。 In addition neither icounter nor pCnt is declared volatile so there's actually no guarantee that the value gets to the global storage location as part of the mutex lock. 另外， icounter和pCnt都没有声明为volatile因此实际上不能保证该值作为互斥锁的一部分到达全局存储位置。

Since the problem is CPU intensive, that is almost all of the time is spent executing CPU instructions multi threading won't help unless you have multiple CPU's (or cores) which the OS are scheduling threads of the same process on. 由于问题是CPU密集型的，因此几乎所有时间都花在执行CPU指令上，除非您有多个CPU（或多个内核），而OS正在调度同一进程的线程，否则多线程将无济于事。 This means that there is a limit of number of threads (that can be as low as 1 - I fx see only a improvement for two threads, beyond that theres none) where you can expect an improved performance. 这意味着线程数量是有限的（可以低至1-我发现只有两个线程有所改进，除此之外没有任何限制），您可以期望获得改进的性能。 What happens if you have more threads than cores is that the OS will just let one thread run for a while on a core and then switch the thread an let the next thread execute for a while. 如果线程数多于内核数，那么操作系统将只让一个线程在内核上运行一会儿，然后切换该线程，让下一线程执行一会儿。

The problem that may arise when scheduling threads on different cores is in addition that each core may have separate cache (which is faster than the shared cache). 在不同内核上调度线程时可能出现的问题还在于，每个内核可能具有单独的缓存（比共享缓存快）。 In effect if two threads are going to access the same memory the separated cache has to be flushed as part of the synchronization of the data involved - this may be time consuming. 实际上，如果两个线程要访问同一内存，则作为相关数据同步的一部分，必须刷新分离的缓存-这可能很耗时。

That is you have to strive to keep the data that the different threads are working on separate and minimize the frequent use of common variable data. 也就是说，您必须努力保持不同线程正在处理的数据分离，并最大程度地减少对公共变量数据的频繁使用。 In your example it would mean that you should avoid the global data as much as possible. 在您的示例中，这意味着您应尽可能避免使用全局数据。 The counter for example need only be accessed when the counting has finished (to add the threads contribution to the count). 例如，仅在计数完成后才需要访问计数器（以将线程贡献添加到计数中）。 Also you could minimize the use of icounter by not reading it for each candidate, but get a bunch of candidates in one go. 您也可以通过不为每个候选人阅读icounter来最大程度地减少icounter的使用，但一次性获得大量候选人。 Something like: 就像是：

void primes() {
     int next;
     int count=0;

     while( (next = getNext(1000)) <= limit ) {
          for( int j = next; j < next+1000 && j <= limit ; j++ ) {
              if( isPrime(j) )
                  count++;
          }
     }

     primeCnt(count);
 }

where getNext is the same, but it reserves a number of candidates (by increasing icounter by the supplied count) and primeCnt adds count to pCnt . 其中getNext相同，但是它保留了一些候选对象（通过将icounter增加所提供的计数）， primeCnt将count添加到pCnt 。

Consequently you may end up in a situation where the core runs one thread, then after a while switch to another thread and so on. 因此，您可能最终会遇到以下情况：内核运行一个线程，然后过一会儿再切换到另一个线程，依此类推。 The result of this is that you will have to run all the code for your problem plus code for switching between the thread. 这样的结果是，您将必须运行问题的所有代码以及在线程之间进行切换的代码。 Add that you will probably have more cache hits, then this will probably even be slower. 加上您可能会有更多的高速缓存命中，那么这可能甚至会更慢。

Answer 4

Perhaps instead of a mutex try to use an atomic integer for the counter. 也许可以使用原子整数代替计数器来代替互斥锁。 It might speed it up a bit, not sure by how much. 可能会加快速度，不确定多少。

#include <atomic>
std::atomic<uint64_t> pCnt; // Made uint64 for bigger range as @IgnisErus mentioned
std::atomic<uint64_t> icounter;

int getNext() {
    return ++icounter; // Pre increment is faster 
}

void primeCnt() {
    ++pCnt;
}

On benchmarking, most of the time the processor need to warm up to get the best performance, so to take the time once is not always a good representation of the actual performance. 在基准测试中，大多数时候处理器需要预热以获得最佳性能，因此花时间一次并不总是代表实际性能。 Try to run the code many times and get an average. 尝试多次运行代码并获得平均值。 You can also try to do some heavy work before you do the calculation (A long for-loop calculating the power of some counter?) 您还可以在进行计算之前尝试做一些繁重的工作（长时间的循环计算某些计数器的功效？）

Getting accurate benchmark results is also a topic of interest for me since I do not yet know how to do it. 因为我尚不知道如何获得准确的基准测试结果，这也是我感兴趣的话题。

指定范围之间的C ++多线程素数计数器

问题描述

4 个解决方案

解决方案1
2 已采纳 2015-10-21 06:13:19

解决方案2
2 2015-10-21 06:16:41

解决方案3
0 2015-10-21 06:20:37

解决方案4
-1 2015-10-21 06:00:50

指定范围之间的C ++多线程素数计数器

问题描述

4 个解决方案

解决方案1 2 已采纳 2015-10-21 06:13:19

解决方案2 2 2015-10-21 06:16:41

解决方案3 0 2015-10-21 06:20:37

解决方案4 -1 2015-10-21 06:00:50

解决方案1
2 已采纳 2015-10-21 06:13:19

解决方案2
2 2015-10-21 06:16:41

解决方案3
0 2015-10-21 06:20:37

解决方案4
-1 2015-10-21 06:00:50