简体   繁体   中英

pthread running thread concurrently c

I need to make the Leibniz algorithm with pthread in c, now I have this code, but at the moment the threads implementation takes the same time of the sequential implementation, I think it is not running concurrently. Can someone see the error.

Thanks!!

#include<stdio.h>
#include<math.h>
#include<pthread.h>
#include<stdlib.h>
#define NUM_THREADS 2
#define ITERATIONS 100000000
double result = 0.0;
void *leibniz(void *threadid){
  int size = ITERATIONS/NUM_THREADS;
  int start = (long)threadid * size;
  int end = ((long)threadid+1) * size;
  int i;
  for(i = start; i<end; i++){
    int denom = 2*i+1;
    result += pow(-1.0, i) * (1.0/denom);
  }
}

int main(){
  pthread_t threads[NUM_THREADS];
  long t;
  int rc;

  // CREATE
  for(t=0;t<NUM_THREADS;t++){
    rc = pthread_create(&threads[t], NULL, leibniz, (void *)t);
    if(rc){
      printf("ERROR: return code %d\n", rc);
    }
  }
  // JOIN
  for(t=0;t<NUM_THREADS;t++){
    rc = pthread_join(threads[t], NULL);
    if(rc){
      printf("ERROR: return code %d\n", rc);
      exit(-1);
    }
  }
  printf("Pi %f\n", result*4);
  exit(0);

}

Thanks to Jean-François Fabre I did these changes, now it works!

double result=0.0;

void *leibniz(void *threadid){
  double local = 0.0;
  int size = ITERATIONS/NUM_THREADS;
  int start = (long)threadid * size;
  int end = ((long)threadid+1) * size;
  int i;
  for(i = start; i<end; i++){
    local += (i%2==0 ? 1 : -1) * (1.0/(2*i+1));
  }
  result += local*4;
}

I'll attempt to answer.

Even if your application is multithreaded, it's not guaranteed that there's 1 FPU per core. I know little about that but I think that some AMD processors actually share the FPU between cores.

Since your loop is basically adding and pow , it's 99% FPU computation, so if the FPU is shared on your computer, it explains the bottleneck.

You could reduce FPU usage by not calling pow just to compute -1 or 1 , which would be a scalar operation, and maybe would make a difference then. just use -1 if i is odd, 1 otherwise, or negate an external 1/-1 variable at each iteration.

Also, in order to avoid race conditions, accumulate the result in a local result, and add it in the end (protecting the addition by a mutex in the end would be even better)

double result = 0.0;
void *leibniz(void *threadid){
  double local = 0.0;
  int size = ITERATIONS/NUM_THREADS;
  int start = (long)threadid * size;
  int end = ((long)threadid+1) * size;
  int i;
  for(i = start; i<end; i++){
    int denom = 2*i+1;
    // using a ternary/scalar speeds up the "pow" computation, multithread or not
    local += (i%2 ? -1 : 1) * (1.0/denom);
  }
  // you may want to protect that addition with a pthread_mutex
  // start of critical section
  result += local;
  // end of critical section
}

http://wccftech.com/amd-one-fpu-per-core-design-zen-processors/

I'm running Visual Studio on Windows, and I haven't installed pthreads, so I created a test program using Windows threads. I split the calculation into one function that calculates all the positive terms and one function that calculates all the negative terms. Double precision is not an issue because the positive sum is < 22, and negative sum is > -19.

The processor is an Intel 3770K 3.5ghz (each core has it's own FPU). I tested calling the two functions in a row versus using a separate thread for the second function and the two thread case is twice as fast as the single thread case, single thread ~ 0.360 seconds, dual thread ~= 0.180 seconds.

#include <stdio.h>
#include <time.h>
#include <windows.h>

static HANDLE ht1;                      /* thread handle */

static DWORD WINAPI Thread0(LPVOID);    /* thread functions */
static DWORD WINAPI Thread1(LPVOID);

static clock_t ctTimeStart;             /* clock values */
static clock_t ctTimeStop;
static double  dTime;

static double pip;              /* sum of positive terms */
static double pim;              /* sum of negative terms */
static double pi;               /* pi */

int main()
{
    ctTimeStart = clock();
    Thread0(NULL);
    Thread1(NULL);
    ctTimeStop = clock();
    dTime = (double)(ctTimeStop - ctTimeStart) / (double)(CLOCKS_PER_SEC);  
    pip *= 4.;          /* pip <  22 after *= 4. */
    pim *= 4.;          /* pim > -19 after *= 4. */
    pi = pip + pim;
    printf("%.16lf %.16lf %.16lf %2.5lf secs\n", pi, pip, pim, dTime);

    ctTimeStart = clock();
    ht1 = CreateThread(NULL, 0, Thread1, 0, 0, 0);
    Thread0(NULL);
    WaitForSingleObject(ht1, INFINITE); // wait for thead 1
    ctTimeStop = clock();
    dTime = (double)(ctTimeStop - ctTimeStart) / (double)(CLOCKS_PER_SEC);  
    pip *= 4.;          /* pip <  22 after *= 4. */
    pim *= 4.;          /* pim > -19 after *= 4. */
    pi = pip + pim;
    printf("%.16lf %.16lf %.16lf %2.5lf secs\n", pi, pip, pim, dTime);

    CloseHandle(ht1);

    return 0;
}

DWORD WINAPI Thread0(LPVOID lpVoid)
{
double pp = 0.;                         /* local sum */
int j;
    for(j = 200000001; j >= 0; j -= 4)
        pp += 1. / (double)(j);
    pip = pp;                           /* store sum */
    return 0;
}

DWORD WINAPI Thread1(LPVOID lpVoid)
{
double pm = 0.;                         /* local sum */
int j;
    for(j = 200000003; j >= 0; j -= 4)
        pm -= 1. / (double)(j);
    pim = pm;                           /* store sum */
    return 0;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM