使用多个线程时性能提升很少

Question

I was implementing multithread Jordan-Gauss method of solving a linear system and I saw that running on two threads took only about 15% less time than running on single thread instead of ideal 50%. 我正在实现解决线性系统的多线程Jordan-Gauss方法，我发现在两个线程上运行所花费的时间比在单线程上运行的时间少约15％而不是理想的50％。 So I wrote a simple program reproducing this. 所以我写了一个复制这个的简单程序。 Here I create a matrix 2000x2000 and give 2000/THREADS_NUM lines to each thread to make some calculations with them. 在这里，我创建一个矩阵2000x2000，并为每个线程提供2000 / THREADS_NUM行，以便对它们进行一些计算。

#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
#include <time.h>

#ifndef THREADS_NUM
#define THREADS_NUM 1
#endif

#define MATRIX_SIZE 2000


typedef struct {
    double *a;
    int row_length;
    int rows_number;
} TWorkerParams;

void *worker_thread(void *params_v)
{
    TWorkerParams *params = (TWorkerParams *)params_v;
    int row_length = params->row_length;
    int i, j, k;
    int rows_number = params->rows_number;
    double *a = params->a;

    for(i = 0; i < row_length; ++i) // row_length is always the same
    {
        for(j = 0; j < rows_number; ++j) // rows_number is inverse proportional
                                         // to the number of threads
        {
            for(k = i; k < row_length; ++k) // row_length is always the same
            {
                a[j*row_length + k] -= 2.;
            }
        }
    }
    return NULL;
}


int main(int argc, char *argv[])
{
    // The matrix is of size NxN
    double *a =
        (double *)malloc(MATRIX_SIZE * MATRIX_SIZE * sizeof(double));
    TWorkerParams *params =
        (TWorkerParams *)malloc(THREADS_NUM * sizeof(TWorkerParams));
    pthread_t *workers = (pthread_t *)malloc(THREADS_NUM * sizeof(pthread_t));
    struct timespec start_time, end_time;
    int rows_per_worker = MATRIX_SIZE / THREADS_NUM;
    int i;
    if(!a || !params || !workers)
    {
        fprintf(stderr, "Error allocating memory\n");
        return 1;
    }
    for(i = 0; i < MATRIX_SIZE*MATRIX_SIZE; ++i)
        a[i] = 4. * i; // just an example matrix
    // Initializtion of matrix is done, now initialize threads' params
    for(i = 0; i < THREADS_NUM; ++i)
    {
        params[i].a = a + i * rows_per_worker * MATRIX_SIZE;
        params[i].row_length = MATRIX_SIZE;
        params[i].rows_number = rows_per_worker;
    }
    // Get start time
    clock_gettime(CLOCK_MONOTONIC, &start_time);
    // Create threads
    for(i = 0; i < THREADS_NUM; ++i)
    {
        if(pthread_create(workers + i, NULL, worker_thread, params + i))
        {
            fprintf(stderr, "Error creating thread\n");
            return 1;
        }
    }
    // Join threads
    for(i = 0; i < THREADS_NUM; ++i)
    {
        if(pthread_join(workers[i], NULL))
        {
            fprintf(stderr, "Error creating thread\n");
            return 1;
        }
    }
    clock_gettime(CLOCK_MONOTONIC, &end_time);
    printf("Duration: %lf msec.\n", (end_time.tv_sec - start_time.tv_sec)*1e3 +
            (end_time.tv_nsec - start_time.tv_nsec)*1e-6);
    return 0;
}

Here how I compile it: 这是我如何编译它：

gcc threads_test.c -o threads_test1 -lrt -pthread -DTHREADS_NUM=1 -Wall -Werror -Ofast
gcc threads_test.c -o threads_test2 -lrt -pthread -DTHREADS_NUM=2 -Wall -Werror -Ofast

Now when I run I get: 现在，当我跑步时，我得到：

./threads_test1
Duration: 3695.359552 msec.
./threads_test2
Duration: 3211.236612 msec.

Which means 2-thread program runs 13% faster than single-thread, even though there is no synchronization between threads and they don't share any memory. 这意味着2线程程序运行速度比单线程快13％，即使线程之间没有同步并且它们不共享任何内存。 I found this answer: https://stackoverflow.com/a/14812411/5647501 and thought that here may be some issues with processor cache, so I added padding, but still result remained the same. 我找到了这个答案： https ： //stackoverflow.com/a/14812411/5647501并认为这可能是处理器缓存的一些问题，所以我添加了填充，但结果仍然相同。 I changed my code as follows: 我改变了我的代码如下：

typedef struct {
    double *a;
    int row_length;
    int rows_number;
    volatile char padding[64 - 2*sizeof(int)-sizeof(double)];
} TWorkerParams;

#define VAR_SIZE (sizeof(int)*5 + sizeof(double)*2)
#define MEM_SIZE ((VAR_SIZE / 64 + 1) * 64  )
void *worker_thread(void *params_v)
{
    TWorkerParams *params = (TWorkerParams *)params_v;
    volatile char memory[MEM_SIZE];
    int *row_length  =      (int *)(memory + 0);
    int *i           =      (int *)(memory + sizeof(int)*1);
    int *j           =      (int *)(memory + sizeof(int)*2);
    int *k           =      (int *)(memory + sizeof(int)*3);
    int *rows_number =      (int *)(memory + sizeof(int)*4);
    double **a        = (double **)(memory + sizeof(int)*5);

    *row_length = params->row_length;
    *rows_number = params->rows_number;
    *a = params->a;

    for(*i = 0; *i < *row_length; ++*i) // row_length is always the same
    {
        for(*j = 0; *j < *rows_number; ++*j) // rows_number is inverse proportional
                                         // to the number of threads
        {
            for(*k = 0; *k < *row_length; ++*k) // row_length is always the same
            {
                (*a + *j * *row_length)[*k] -= 2. * *k;
            }
        }
    }
    return NULL;
}

So my question is: why do I get only 15% speed-up instead of 50% when using two threads here? 所以我的问题是：为什么在这里使用两个线程时，我只获得15％的加速而不是50％？ Any help or suggestion will be appreciated. 任何帮助或建议将不胜感激。 I am running 64-bit Ubuntu Linux, kernel 3.19.0-39-generic, CPU Intel Core i5 4200M (two physical cores with multithreading), but I also tested it on two other machines with the same result. 我正在运行64位Ubuntu Linux，内核3.19.0-39通用，CPU Intel Core i5 4200M（两个带有多线程的物理内核），但我也在另外两台机器上测试了它，结果相同。

EDIT: If I replace a[j*row_length + k] -= 2.; 编辑：如果我替换a[j*row_length + k] -= 2.; with a[0] -= 2.; a[0] -= 2.; , I get expected speed-up: ，我得到预期的加速：

./threads_test1
Duration: 1823.689481 msec.
./threads_test2
Duration: 949.745232 msec.

EDIT 2: Now, when I replaced it with a[k] -= 2.; 编辑2：现在，当我用a[k] -= 2.;替换它时a[k] -= 2.; I get the following: 我得到以下内容：

./threads_test1
Duration: 1039.666979 msec.
./threads_test2
Duration: 1323.460080 msec.

This one I can't get at all. 这个我根本无法得到。

Answer 1

This is a classic issue, switch the i and j for loops. 这是一个经典问题，切换i和j for循环。

You are iterating through columns first and in the inner loop you process rows, that means you have much more cache misses than necessary. 您首先遍历列，然后在内部循环中处理行，这意味着您有更多的缓存未命中。

My results with the original code (the first version without padding): 我的结果与原始代码（没有填充的第一个版本）：

$ ./matrix_test1
Duration: 4620.799763 msec.
$ ./matrix_test2
Duration: 2800.486895 msec.

(better improvement than yours actually) （实际比你的改进更好）

After switching the for loops for i and j: 切换i和j的for循环后：

$ ./matrix_test1
Duration: 1450.037651 msec.
$ ./matrix_test2
Duration: 728.690853 msec.

Here the 2-times speedup. 这里加速2倍。

EDIT: In the fact the original is not that bad because the k index still goes through the row iterating columns, but is is still much better to iterate the row in the outer loop. 编辑：事实上原始并没有那么糟糕，因为k索引仍然通过行迭代列，但是在外循环中迭代行仍然好得多。 And when the i rises, you are processing less and less items in the most inner loop, so it still matters. 当i上升时，你在最内循环中处理的项目越来越少，所以它仍然很重要。

EDIT2: (removed the block solution because it was actually producing different results) - but it still should be possible to utilize blocks to improve cache performance. EDIT2 :(删除了块解决方案，因为它实际上产生了不同的结果） - 但仍然应该可以利用块来提高缓存性能。

Answer 2

Do you speak about 13 % of speed up, but what is the time elapsed on your calculus fonction and not in the rest of programm. 你说的是加速的13％，但你的微积分功能所用的时间是多少，而不是其他程序。

You could start to estimate only the time passed on the calcul method without the time of thread management. 您可以开始仅估计在没有线程管理时间的情况下传递给calcul方法的时间。 It's possible that you lose an important part of your time in the thread managmement. 您可能会在线程管理中失去重要的时间。 That's could explain the small speed up that you obtained. 这可以解释你获得的小加速。

In other part, 50% of speed up with 2 threads it's quite impossible to obtain. 在其他方面，50％的加速与2线程，这是非常不可能获得。

使用多个线程时性能提升很少

问题描述

2 个解决方案

解决方案1
7 2015-12-07 16:33:21

解决方案2
1 2015-12-07 15:54:40

使用多个线程时性能提升很少

问题描述

2 个解决方案

解决方案1 7 2015-12-07 16:33:21

解决方案2 1 2015-12-07 15:54:40

解决方案1
7 2015-12-07 16:33:21

解决方案2
1 2015-12-07 15:54:40