为什么我的并行代码比串行慢？

Question

Issue问题

Hello everyone, I have got a program (from the net) that I intend to speed up by converting it into its parallel version with the use of pthreads .大家好，我有一个程序（来自网络），我打算通过使用pthreads将其转换为并行版本来加速。 But surprisingly though, it runs slower than the serial version.但令人惊讶的是，它的运行速度比串行版本慢。 Below is the program:下面是程序：

# include <stdio.h>

//fast square root algorithm
double asmSqrt(double x) 
{
  __asm__ ("fsqrt" : "+t" (x));
  return x;
}

//test if a number is prime
bool isPrime(int n)
{   
    if (n <= 1) return false;
    if (n == 2) return true;
    if (n%2 == 0) return false;

    int sqrtn,i;
    sqrtn = asmSqrt(n);

    for (i = 3; i <= sqrtn; i+=2) if (n%i == 0) return false;
    return true;
}

//number generator iterated from 0 to n
int main()
{
    n = 1000000; //maximum number
    int k,j;

    for (j = 0; j<= n; j++)
    {
        if(isPrime(j) == 1) k++;
        if(j == n) printf("Count: %d\n",k);
    }
    return 0;
}

First attempt for parallelization第一次尝试并行化

I let the pthread manage the for loop我让pthread管理for loop

# include <stdio.h>
.
.

int main()
{
    .
    .
    //----->pthread code here<----
    for (j = 0; j<= n; j++)
    {
        if(isPrime(j) == 1) k++;
        if(j == n) printf("Count: %d\n",k);
    }
    return 0;
}

Well, it runs slower than the serial one嗯，它比串行运行慢

Second attempt第二次尝试

I divided the for loop into two threads and run them in parallel using pthreads我将for loop分成两个线程并使用pthreads并行运行它们

However, it still runs slower, I am intending that it may run about twice as fast or well faster.但是，它的运行速度仍然较慢，我打算将其运行速度提高两倍或更快。 But its not!但它不是！

These is my parallel code by the way:顺便说一下，这些是我的并行代码：

# include <stdio.h>
# include <pthread.h>
# include <cmath>

# define NTHREADS 2

pthread_mutex_t mutex1 = PTHREAD_MUTEX_INITIALIZER;
int k = 0;

double asmSqrt(double x) 
{
  __asm__ ("fsqrt" : "+t" (x));
  return x;
}

struct arg_struct
{
    int initialPrime;
    int nextPrime;
};

bool isPrime(int n)
{   
    if (n <= 1) return false;

    if (n == 2) return true;

    if (n%2 == 0) return false;

    int sqrtn,i;
    sqrtn = asmSqrt(n);

    for (i = 3; i <= sqrtn; i+=2) if (n%i == 0) return false;

    return true;
}

void *parallel_launcher(void *arguments)
{
    struct arg_struct *args = (struct arg_struct *)arguments;

    int j = args -> initialPrime;
    int n = args -> nextPrime - 1;

    for (j = 0; j<= n; j++)
    {
        if(isPrime(j) == 1)
        {
            printf("This is prime: %d\n",j);
pthread_mutex_lock( &mutex1 );
            k++;
pthread_mutex_unlock( &mutex1 );
        }

        if(j == n) printf("Count: %d\n",k);
    }
pthread_exit(NULL);
}

int main()
{
    int f = 100000000;
    int m;

    pthread_t thread_id[NTHREADS];
    struct arg_struct args;

    int rem = (f+1)%NTHREADS;
    int n = floor((f+1)/NTHREADS);

    for(int h = 0; h < NTHREADS; h++)
    {
        if(rem > 0)
        {
            m = n + 1;
            rem-= 1;
        }
        else if(rem == 0)
        {
            m = n;
        }

        args.initialPrime = args.nextPrime;
        args.nextPrime = args.initialPrime + m;

        pthread_create(&thread_id[h], NULL, &parallel_launcher, (void *)&args);
        pthread_join(thread_id[h], NULL);
    }
   // printf("Count: %d\n",k);
    return 0;
}

Note: OS: Fedora 21 x86_64, Compiler: gcc-4.4, Processor: Intel Core i5 (2 physical core, 4 logical), Mem: 6 Gb, HDD: 340 Gb,注意：操作系统：Fedora 21 x86_64，编译器：gcc-4.4，处理器：Intel Core i5（2 个物理核心，4 个逻辑核心），内存：6 Gb，硬盘：340 Gb，

Answer 1

You need to split the range you are examining for primes up into n parts, where n is the number of threads.您需要将要检查的素数范围拆分为n部分，其中n是线程数。

The code that each thread runs becomes:每个线程运行的代码变为：

typedef struct start_end {
    int start;
    int end;
} start_end_t;

int find_primes_in_range(void *in) {
    start_end_t *start_end = (start_end_t *) in;

    int num_primes = 0;
    for (int j = start_end->start; j <= start_end->end; j++) {
       if (isPrime(j) == 1)
           num_primes++;
    }
    pthread_exit((void *) num_primes;
}

The main routine first starts all the threads which call find_primes_in_range , then calls pthread_join for each thread.该main程序首先启动的所有线程调用哪个find_primes_in_range ，然后调用pthread_join为每个线程。 It sums all the values returned by find_primes_in_range .它对find_primes_in_range返回的所有值find_primes_in_range 。 This avoids locking and unlocking a shared count variable.这避免了锁定和解锁共享计数变量。

This will parallelize the work, but the amount of work per thread will not be equal.这将并行化工作，但每个线程的工作量将不相等。 This can be addressed but is more complicated.这可以解决，但更复杂。

Answer 2

The main design flaw: you must let each thread have its own private counter variable instead of using the shared one.主要设计缺陷：您必须让每个线程都有自己的私有计数器变量，而不是使用共享的计数器变量。 Otherwise they will spend far more time waiting on and handling that mutex, than they will do on the actual calculation.否则，他们将花费更多的时间等待和处理该互斥锁，而不是实际计算。 You are essentially forcing the threads to execute in serial.您实际上是在强制线程串行执行。

Instead, sum everything up with a private counter variable and once a thread is done with its work, return the counter variable and sum them up in main().取而代之的是，使用私有计数器变量将所有内容汇总，一旦线程完成其工作，返回计数器变量并在 main() 中汇总它们。

Also, you should not call printf() from inside the threads.此外，您不应从线程内部调用 printf()。 If there is a context switch in the middle of a printf call, you'll end up with crappy output such as This is This is prime: 2 .如果在 printf 调用中间有上下文切换，您最终会得到蹩脚的输出，例如This is This is prime: 2 。 In which case you must synchronize the printf calls between threads, which will slow the program down again.在这种情况下，您必须同步线程之间的 printf 调用，这将再次减慢程序的速度。 Also, the printf() calls themselves are likely 90% of the work that the thread is doing.此外， printf() 调用本身可能占线程正在执行的工作的 90%。 So some sort of re-design of who does the printing might be a good idea, depending on what you want to do with the results.因此，对谁进行打印进行某种重新设计可能是一个好主意，这取决于您想对结果做什么。

Answer 3

Summary概括

Indeed, the use of PThread speed up my code.确实，使用 PThread 加快了我的代码。 It was my programming flaw of placing pthread_join right after the first pthread_create and the common counter I have set on arguments.这是我在第一个pthread_create和我在参数上设置的公共计数器之后立即放置pthread_join编程缺陷。 After fixing this up, I tested my parallel code to determine the primality of 100 Million numbers then compared its processing time with a serial code.解决这个问题后，我测试了我的并行代码以确定 1 亿个数字的素数，然后将其处理时间与串行代码进行比较。 Below are the results.以下是结果。

http://i.stack.imgur.com/gXFyk.jpg (I could not attach the image as I don't have much reputation yet, instead, I am including a link) http://i.stack.imgur.com/gXFyk.jpg （我无法附上图片，因为我还没有多少声誉，相反，我提供了一个链接）

I conducted three trials for each to account for the variations caused by different OS activities.我对每个试验进行了三项试验，以解释由不同操作系统活动引起的变化。 We got speed up for utilizing parallel programming with PThread .我们加快了使用PThread并行编程的PThread 。 What is surprising is a PThread code running in ONE thread was a bit faster than purely serial code.令人惊讶的是，在 ONE 线程中运行的PThread代码比纯串行代码快一点。 I could not explain this one, nevertheless using PThreads is well, surely worth a try.我无法解释这个，不过使用PThreads很好，当然值得一试。

Here is the corrected parallel version of the code (gcc-c++):这是代码的更正并行版本（gcc-c++）：

# include <stdio.h>
# include <pthread.h>
# include <cmath>

# define NTHREADS 4

double asmSqrt(double x) 
{
  __asm__ ("fsqrt" : "+t" (x));
  return x;
}

struct start_end_f
{
    int start;
    int end;
};

//test if a number is prime
bool isPrime(int n)
{
    if (n <= 1) return false;
    if (n == 2) return true;
    if (n%2 == 0) return false;

    int sqrtn = asmSqrt(n);
    for (int i = 3; i <= sqrtn; i+=2) if (n%i == 0) return false;

    return true;
}

//executes the tests for prime in a certain range, other threads will test the next range and so on..
void *find_primes_in_range(void *in) 
{
    int k = 0;

    struct start_end_f *start_end_h = (struct start_end_f *)in;

    for (int j = start_end_h->start; j < (start_end_h->end +1); j++) 
    {
        if(isPrime(j) == 1) k++;
    }

    int *t = new int;
    *t = k;
    pthread_exit(t);
}

int main() 
{
    int f = 100000000; //maximum number to be tested for prime

    pthread_t thread_id[NTHREADS];
    struct start_end_f start_end[NTHREADS];

    int rem = (f+1)%NTHREADS;
    int n = (f+1)/NTHREADS;
    int rem_change = rem;
    int m;

    if(rem>0) m = n+1;
    else if(rem == 0) m = n;

    //distributes task 'evenly' to the number of parallel threads requested
    for(int h = 0; h < NTHREADS; h++)
    {
        if(rem_change > 0)
        {
            start_end[h].start = m*h;
            start_end[h].end = start_end[h].start+m-1;
            rem_change -= 1;
        }
        else if(rem_change<= 0)
        {
            start_end[h].start = m*(h+rem_change)-rem_change*n;
            start_end[h].end = start_end[h].start+n-1;
            rem_change -= 1;
        }
        pthread_create(&thread_id[h], NULL, find_primes_in_range, &start_end[h]);
    }   

    //retreiving returned values
    int *t;
    int c = 0;
    for(int h = 0; h < NTHREADS; h++)
    {
        pthread_join(thread_id[h], (void **)&t);
        int b = *((int *)t);
        c += b;
        b = 0;
    }

    printf("\nNumber of Primes: %d\n",c);
    return 0;
}

为什么我的并行代码比串行慢？

问题描述

3 个解决方案

解决方案1
2 已采纳 2015-06-15 06:08:04

解决方案2
2 2015-06-15 06:32:09

解决方案3
0 2015-06-18 06:28:44

为什么我的并行代码比串行慢？

问题描述

3 个解决方案

解决方案1 2 已采纳 2015-06-15 06:08:04

解决方案2 2 2015-06-15 06:32:09

解决方案3 0 2015-06-18 06:28:44

解决方案1
2 已采纳 2015-06-15 06:08:04

解决方案2
2 2015-06-15 06:32:09

解决方案3
0 2015-06-18 06:28:44