简体   繁体   English

为什么多线程(使用 pthread)似乎比多进程(使用 fork)慢?

[英]Why the multi-threading(using pthread) seems slower than multi-process(using fork)?

Here I have tried to add all numbers between 0 and 1e9 using 3 methods:在这里,我尝试使用 3 种方法添加 0 到 1e9 之间的所有数字:

  1. Normal Sequential execution(Single Thread)普通顺序执行(单线程)
  2. Creating multiple process to add a smaller part(using fork) and adding all smaller parts at end, and创建多个进程以添加较小的部分(使用 fork)并在最后添加所有较小的部分,以及
  3. Creating multiple thread to do same as of 2nd method.创建多个线程以与第二种方法相同。

As far as I know that thread creations are fast and hence called light-weight process.据我所知,线程创建速度很快,因此称为轻量级进程。

But on executing my code, I found the 2nd method (multiple process) was the fastest, followed by 1st method (Sequential) and then 3rd (multi-threading).但是在执行我的代码时,我发现第二个方法(多进程)是最快的,其次是第一个方法(顺序),然​​后是第三个(多线程)。 But I am unable to figure out why is that happening so (May be some mistakes in execution time calculation, or make be something is different in my system, etc).但我无法弄清楚为什么会这样(可能是执行时间计算中的一些错误,或者我的系统中的某些内容有所不同,等等)。

Here is my code C code:这是我的代码 C 代码:

#include "stdlib.h"
#include "stdio.h"
#include "unistd.h"
#include "string.h"
#include "time.h"
#include "sys/wait.h"
#include "sys/types.h"
#include "sys/sysinfo.h"
#include "pthread.h"
#define min(a,b) (a < b ? a : b)

int n = 1e9 + 24; // 2, 4, 8 multiple 

double show(clock_t s, clock_t e, int n, char *label){
    double t = (double)(e - s)/(double)(CLOCKS_PER_SEC);
    printf("=== N %d\tT %.6lf\tlabel\t%s === \n", n, t, label);
    return t;
}

void init(){
    clock_t start, end;
    long long int sum = 0;
    start = clock();
    for(int i=0; i<n; i++) sum += i;
    end = clock();
    show(start, end, n, "Single thread");
    printf("Sum %lld\n", sum); 
}

long long eachPart(int a, int b){
    long long s = 0;
    for(int i=a; i<b; i++) s += i;
    return s;
}
// multiple process with fork
void splitter(int a, int b, int fd[2], int n_cores){ // a,b are useless (ignore)
    clock_t s, e;
    s = clock();
    int ncores = n_cores;
    // printf("cores %d\n", ncores);
    int each = (b - a)/ncores, cc = 0;
    pid_t ff; 
    for(int i=0; i<n; i+=each){
        if((ff = fork()) == 0 ){
            long long sum = eachPart(i, min(i + each, n) );
            // printf("%d->%d, %d - %d - %lld\n", i, i+each, cc, getpid(), sum);
            write(fd[1], &sum, sizeof(sum));
            exit(0);
        }
        else if(ff > 0) cc++;
        else printf("fork error\n");
    }
    int j = 0;
    while(j < cc){
        int res = wait(NULL);
        // printf("finished r: %d\n", res);
        j++;
    }
    long long ans = 0, temp;
    while(cc--){
        read(fd[0], &temp, sizeof(temp));
        // printf("c : %d, t : %lld\n", cc, temp);
        ans += temp;
    }
    e = clock();
    show(s, e, n, "Multiple processess used");
    printf("Sum %lld\tcores used %d\n", ans, ncores);
}


// multi threading used 
typedef struct SS{
    int s, e;
} SS;

int tfd[2];

void* subTask(void *p){
    SS *t = (SS*)p;
    long long *s = (long long*)malloc(sizeof(long long)); 
    *s = 0;
    for(int i=t->s; i<t->e; i++){
        (*s) = (*s) + i;
    }
    write(tfd[1], s, sizeof(long long));
    return NULL;
}

void threadSplitter(int a, int b, int n_thread){ // a,b are useless (ignore)
    clock_t sc, e;
    sc = clock();
    int nthread = n_thread;
    pthread_t thread[nthread];
    int each = n/nthread, cc = 0, s = 0;
    for(int i=0; i<nthread; i++){
        if(i == nthread - 1){
            SS *t = (SS*)malloc(sizeof(SS));
            t->s = s, t->e = n; // start and end point
            if((pthread_create(&thread[i], NULL, &subTask, t))) printf("Thread failed\n");
            s = n; // update start point
        }
        else {
            SS *t = (SS*)malloc(sizeof(SS));
            t->s = s, t->e = s + each; // start and end point
            if((pthread_create(&thread[i], NULL, &subTask, t))) printf("Thread failed\n");
            s += each; // update start point
        }
    }
    long long ans = 0, tmp;
    // for(int i=0; i<nthread; i++){
    //     void *dd;
    //     pthread_join(thread[i], &dd); 
    //     // printf("i : %d s : %lld\n", i, *((long long*)dd));
    //     ans += *((long long*)dd);
    // }
    int cnt = 0;
    while(cnt < nthread){
        read(tfd[0], &tmp, sizeof(tmp));
        ans += tmp;
        cnt += 1;
    }
    e = clock();
    show(sc, e, n, "Multi Threading");
    printf("Sum %lld\tThreads used %d\n", ans, nthread);
}

int main(int argc, char* argv[]){
    init();

    printf("argc : %d\n", argc);
    
    // ncore - processes
    int fds[2];
    pipe(fds);
    int cores = get_nprocs();
    splitter(0, n, fds, cores);
    for(int i=1; i<argc; i++){
        cores = atoi(argv[i]);
        splitter(0, n, fds, cores);
    }
    
    // nthread - calc
    pipe(tfd); 
    threadSplitter(0, n, 16);
    for(int i=1; i<argc; i++){
        int threads = atoi(argv[i]);
        threadSplitter(0, n, threads);
    }

    return 0;
}

Output Results:输出结果:

=== N 1000000024    T 2.115850  label   Single thread === 
Sum 500000023500000276
argc : 4
=== N 1000000024    T 0.000467  label   Multiple processess used === 
Sum 500000023500000276  cores used 8
=== N 1000000024    T 0.000167  label   Multiple processess used === 
Sum 500000023500000276  cores used 2
=== N 1000000024    T 0.000436  label   Multiple processess used === 
Sum 500000023500000276  cores used 4
=== N 1000000024    T 0.000755  label   Multiple processess used === 
Sum 500000023500000276  cores used 6
=== N 1000000024    T 2.677858  label   Multi Threading === 
Sum 500000023500000276  Threads used 16
=== N 1000000024    T 2.204447  label   Multi Threading === 
Sum 500000023500000276  Threads used 2
=== N 1000000024    T 2.235777  label   Multi Threading === 
Sum 500000023500000276  Threads used 4
=== N 1000000024    T 2.534276  label   Multi Threading === 
Sum 500000023500000276  Threads used 6

Also, I have used pipe to transport the results of sub tasks.另外,我使用管道来传输子任务的结果。 In multi-threading I have also tried to use join thread and sequentially merge the results but the final result was similar around 2 sec execution time.在多线程中,我也尝试使用连接线程并按顺序合并结果,但最终结果在大约 2 秒的执行时间上相似。

Output:输出: 终端输出

TL;DR: you are measuring time in the wrong way. TL;DR:您以错误的方式测量时间。 Use clock_gettime(CLOCK_REALTIME, ...) instead of clock() .使用clock_gettime(CLOCK_REALTIME, ...)而不是clock()


You are measuring time using clock() , which as stated on the manual page:您正在使用clock()测量时间,如手册页所述:

[...] returns an approximation of processor time used by the program. [...] 返回程序使用的处理器时间的近似值。 [...] The value returned is the CPU time used so far as a clock_t [...] 返回的值是目前使用的 CPU 时间作为clock_t

The system clock used by clock() measures CPU time, which is the time spent by the calling process while using the CPU. clock()使用的系统时钟测量 CPU 时间,即调用进程在使用 CPU 时所花费的时间。 The CPU time used by a process is the sum of the CPU time used by all of its threads, but not its children, since those are different processes.进程使用的 CPU 时间是其所有线程使用的 CPU 时间的总和,但不是其子进程,因为它们是不同的进程。 See also: What specifically are wall-clock-time, user-cpu-time, and system-cpu-time in UNIX?另请参阅: UNIX 中的挂钟时间、用户 CPU 时间和系统 CPU 时间具体是什么?

Therefore, the following happens in your 3 scenarios:因此,在您的 3 个场景中会发生以下情况:

  1. No parallelism, sequential code.没有并行性,顺序代码。 The CPU time spent running the process is pretty much all there is to measure, and will be very similar to the actual wall-clock time spent.运行该进程所花费的 CPU 时间几乎是所有可以测量的,并且与实际花费的挂钟时间非常相似。 Note that CPU time of a single threaded program is always lower or equal than its wall-clock time.请注意,单线程程序的 CPU 时间始终低于或等于其挂钟时间。

  2. Multiple child processes.多个子进程。 Since you are creating child processes to do the actual work on behalf of the main (parent) process, the parent will use almost zero CPU time: the only thing that it has to do is a few syscalls to create the children and then a few syscalls to wait for them to exit.由于您正在创建子进程来代表主(父)进程完成实际工作,因此父进程将使用几乎为零的 CPU 时间:它唯一需要做的就是一些系统调用来创建子进程,然后是一些系统调用等待他们退出。 Most of its time is spent sleeping waiting for the children, not running on the CPU.它的大部分时间都花在等待孩子的睡眠上,而不是在 CPU 上运行。 The children processes are the one that run on the CPU, but you are not measuring their time at all.子进程是在 CPU 上运行的进程,但您根本没有测量它们的时间。 Therefore you end up with a very short time (1ms).因此,您最终的时间很短(1 毫秒)。 You are basically not measuring anything at all here.你在这里基本上没有测量任何东西。

  3. Multiple threads.多线程。 Since you are creating N threads to do the work, and taking the CPU time in the main thread only, the CPU time of your process will account to the sum of CPU times of the threads.由于您正在创建 N 个线程来完成工作,并且仅在主线程中占用 CPU 时间,因此您的进程的 CPU 时间将占线程的 CPU 时间总和。 It should come to no surprise that if you are doing the exact same calculation, the average CPU time spent by each thread is T/NTHREADS, and summing them up will give you T/NTHREADS * NTHREADS = T. Indeed you are using roughly the same CPU time as the first scenario, only with a little bit of overhead for creating and managing the threads.毫不奇怪,如果您进行完全相同的计算,每个线程花费的平均 CPU 时间为 T/NTHREADS,将它们相加将得到 T/NTHREADS * NTHREADS = T。实际上,您大致使用的是与第一个场景相同的 CPU 时间,只有一点点用于创建和管理线程的开销。

All of this can be solved in two ways:所有这些都可以通过两种方式解决:

  1. Carefully account for CPU time in the correct way in each thread/process and then proceed to sum or average the values as needed.在每个线程/进程中以正确的方式仔细考虑 CPU 时间,然后根据需要继续对这些值求和或求平均值。
  2. Simply measure wall-clock time (ie real human time) instead of CPU time using clock_gettime with CLOCK_REALTIME .使用带有CLOCK_REALTIME clock_gettime简单地测量挂钟时间(即真实的人类时间)而不是 CPU 时间。 Refer to the manual page for more info.有关更多信息,请参阅手册页

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM