为什么pthread会降低代码速度？

Question

I new to pthreads and I write this code for testing. 我是pthread的新手，我编写了这段代码进行测试。 I don't understand why if I run the code with only 1 pthread it complete faster than when I run with multiple pthreads. 我不明白为什么只用1个pthread运行代码比用多个pthreads运行代码更快。 The code is the setting part of a genetic algoritm to resolve a TSP. 该代码是解决TSP的遗传算法的设置部分。 I have 3 linear arrays (city_x, city_y, city_id) that save the data: 我有3个线性数组（city_x，city_y，city_id）保存数据：

1 for the x x的1
1 for the y y为1
1 for the id of each city 每个城市的ID为1

These array are like linearized and represent the elements of the population. 这些数组就像线性化的，代表总体元素。 Each element have NUM_CITIES data for x,y and id. 每个元素都有NUM_CITIES个x，y和id数据。 So if we have: 因此，如果我们有：

3 elements for the population 人口的3个要素
10 NUM_CITIES for each element 每个元素10个NUM_CITIES
the total number of data for each array is 3*10=30 每个数组的数据总数为3 * 10 = 30

The code require in input the numbers of the elements of a population, sets some coordinate in city_set arrays and create the global array with the coordinates x,y,and id of all the element of the entire population. 该代码在输入时要求总体元素的数量，在city_set数组中设置一些坐标，并使用整个总体所有元素的坐标x，y和id创建全局数组。

#include <pthread.h>

#include <limits> // std::numeric_limits<double>
#include <iostream>
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include <utility>
//#include <math.h>
#include <algorithm>    // std::lower_bound, std::find
#include <random>
#include <cmath> 
#include <cstring>
#include <iomanip>      // std::setprecision
#include <vector>       // std::vector

#define NUM_CITIES 10  // This is a tour for the LIN105. It has length 14379.
// #define SIZE_POP 100000000
#define SIZE_MATING 3
#define MUTATION_RATE 0.03
#define STALL_LIMIT 10

// variabili condivise
long size_pop = 0;
long tot_elem = 0;
const int num_threads = 24;
int tid[num_threads];
int start[num_threads];
int stop[num_threads];

// città
int city_set_x[NUM_CITIES];
int city_set_y[NUM_CITIES];
int city_set_id[NUM_CITIES];

// elementi della popolazione
int *city_x;
int *city_y;
int *city_id;

void *setup(void *p) {

    int id = *(int *)p;
    // std::cout << "id: " << id << "\n";

    int s = start[id];

    int perm[NUM_CITIES];
    for(int i = 0; i < NUM_CITIES; ++i) {
        perm[i] = i;
        // std::cout << perm[i] << ",";
    }

    for(long i = start[id]; i < stop[id]; i += NUM_CITIES) {
        std::random_shuffle ( perm, perm + NUM_CITIES );

        for(int j = 0; j < NUM_CITIES; ++j) {
            city_id[i + j] =  perm[j];
            city_x[i + j] =  city_set_x[perm[j]];
            city_y[i + j] =  city_set_y[perm[j]];
            // std::cout << "(" << city_x[i + j] << "," << city_y[i + j] << ") ";
        }
        // std::cout << "\n";
    }

}


static inline const double diffmsec(const struct timeval & a, 
                                    const struct timeval & b) {
    long sec  = (a.tv_sec  - b.tv_sec);
    long usec = (a.tv_usec - b.tv_usec);

    if(usec < 0) {
        --sec;
        usec += 1000000;
    }
    return ((double)(sec*1000)+ (double)usec/1000.0);
}

int main(int argc, char *argv[]) {

    size_pop = atol(argv[1]);

    std::cout << size_pop << "\n";

    tot_elem = NUM_CITIES * size_pop;
    std::cout << "tot_elem: " << tot_elem << "\n";

    struct timeval program_start, program_end, setup_start, setup_end;

    std::vector<double> v_set;

    city_x = (int *)malloc(tot_elem * sizeof(int));
    // memset(city_x, -1, tot_elem * sizeof(int));
    city_y = (int *)malloc(tot_elem * sizeof(int));
    // memset(city_y, -1, tot_elem * sizeof(int));
    city_id = (int *)malloc(tot_elem * sizeof(int));
    for(int i = 0; i < tot_elem; ++i) {
        city_x[i] = -1;
        city_y[i] = -1;
        city_id[i] = -1;
    }

    srand(time(NULL));

    int x[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    int y[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};


    // stampa
    std::cout << "[CITTA.X]\n";
    for(int i = 0; i < NUM_CITIES; ++i) {

        city_set_x[i] = x[i];
        // city_set[i].x = i + 1;
        std::cout << city_set_x[i] << " ";
    }
    std::cout << "\n";

    std::cout << "[CITTA.Y]\n";
    for(int i = 0; i < NUM_CITIES; ++i) {

        city_set_y[i] = y[i];
        // city_set[i].y = i + 1;
        std::cout << city_set_y[i] << " ";
    }
    std::cout << "\n";

    std::cout << "[CITTA.ID]\n";
    for(int i = 0; i < NUM_CITIES; ++i) {

        city_set_id[i] = i;
        std::cout << city_set_id[i] << " ";
    }
    std::cout << "\n";

    // std::cin.get() != '\n';

    pthread_t threads[num_threads];

    for(int i = 0; i < num_threads; ++i) {
        tid[i] = i;
        start[i] = i * NUM_CITIES * floor(size_pop/num_threads);
        // std::cout << "start: " << start << "\n";
        if(i != num_threads - 1) {
            stop[i] = start[i] + (floor(size_pop/num_threads) * NUM_CITIES);
            // std::cout << "stop: " << stop << "\n";
        }
        else {
            stop[i] = tot_elem;
            // std::cout << "stop: " << stop << "\n";
        }
        // std::cout << "\n";
    }

    for(int c = 0; c < 10; c++) {

        gettimeofday(&setup_start, NULL);

        for(int i = 0; i < num_threads; ++i) {
            if( pthread_create( &threads[i], NULL, &setup, (void *) &tid[i]) )
            {
              printf("Thread creation failed\n");
            }
        }

        for(int i = 0; i < num_threads; ++i) {
            pthread_join( threads[i], NULL);
        }

        gettimeofday(&setup_end, NULL);
        v_set.push_back(diffmsec(setup_end, setup_start) / 1000);
    }

    // // stampa
    // std::cout << "[SETUP]\n";
    // for(int i = 0; i < size_pop; ++i){
    //  long idx = i * NUM_CITIES;
    //  std::cout << "pop[" << i << "]: ";
    //  for(int j = 0; j < NUM_CITIES; ++j){
    //      std::cout << "(" << city_x[idx + j] << "," << city_y[idx + j] << ") ";
    //  }
    //  std::cout << "\n";
    // }

    double sum = 0;
    double mean;


    sum = 0;
    for (int i = 0; i < v_set.size(); ++i) {
        sum += v_set[i];
    }
    mean = sum / v_set.size();
    std::cout << "[SET]: " << mean << " s\n";

    free(city_x);
    free(city_y);
    free(city_id);

}

I run the code with 1000000 elemets setting the number of thread to 1 and the result is 0.332 s . 我使用1000000 elemets运行代码，将线程数设置为1， 结果为0.332 s 。 After running with 1000000 elemets but setting the number of threads to 4 the result is 1.361 s . 以1000000 elemets运行但将线程数设置为4后， 结果为1.361 s 。 If I increment the number at 24 the result is 0.60 s but is twice the sequential! 如果我在24处增加数字，则结果为0.60 s，但是顺序的两倍！ When I go over the 24 number of threads the result stay that or increment again. 当我超过24个线程数时，结果保持不变或再次增加。

EDIT 编辑

Using: grep -c processor /proc/cpuinfo 使用： grep -c处理器/ proc / cpuinfo

I obtain 56. 我得到56。

Using: cat /proc/cpuinfo 使用： cat / proc / cpuinfo

processor : 0 处理器：0

vendor_id : GenuineIntel vendor_id：正版英特尔

cpu family : 6 cpu家族：6

model : 79 型号：79

model name : Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz stepping : 1 型号名称：Intel（R）Xeon（R）CPU E5-2680 v4 @ 2.40GHz步进：1

microcode : 0xb00001e 微码：0xb00001e

cpu MHz : 1967.906 cpu兆赫：1967.906

cache size : 35840 KB 缓存大小：35840 KB

physical id : 0 物理编号：0

siblings : 28 兄弟姐妹：28

core id : 0 核心编号：0

cpu cores : 14 cpu核心：14

apicid : 0 尖酸：0

initial apicid : 0 初始杀虫剂：0

fpu : yes fpu：是的

fpu_exception : yes fpu_exception：是

cpuid level : 20 cpuid等级：20

wp : yes wp：是的

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch arat epb pln pts dtherm intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local 标志：fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtss lm constant_tsc arch_perftops tcptstqttstptstqtst qpstqtst VMX SMX EST TM2 SSSE3 FMA CX16 xtpr PDCM PCID DCA sse4_1 sse4_2 x2apic movbe POPCNT tsc_deadline_timer AES XSAVE AVX F16C rdrand lahf_lm ABM 3dnowprefetch ARAT EPB PLN PTS dtherm intel_pt tpr_shadow vnmi FlexPriority可EPT VPID fsgsbase tsc_adjust BMI1 HLE AVX2 SMEP bmi2 ERMS invpcid RTM CQM rdseed ADX SMAP xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local

bogomips : 4799.62 bogomips：4799.62

clflush size : 64 clflush大小：64

cache_alignment : 64 cache_alignment：64

address sizes : 46 bits physical, 48 bits virtual 地址大小：物理46位，虚拟48位

for each of the 56 processors. 对于56个处理器中的每一个。

Answer 1

std::random_shuffle uses a shared resource, all the threads use it, so your program has high contention, threads are mostly waiting for each other. std::random_shuffle使用共享资源，所有线程都使用它，因此您的程序具有较高的竞争性，线程主要在等待对方。 Use a separate random generator (for example, std::mt19937 with std::shuffle , check out cppreference ) for each thread. 为每个线程使用单独的随机数生成器（例如， std::mt19937和std::shuffle ，检出cppreference ）。

Furthermore, you may want to increase NUM_CITIES, so each thread uses separate cache lines. 此外，您可能希望增加NUM_CITIES，因此每个线程使用单独的缓存行。

Answer 2

Running code with various threads, required the system to make a context switch between each thread, meaning that you have a computational overhead without actually gaining any benefit from it. 运行具有多个线程的代码，要求系统在每个线程之间进行上下文切换，这意味着您要承担计算开销，而实际上并没有从中获得任何好处。 Also you require a loop to compute thread parameters that becomes more computational intensive the more threads are generated, but this is probably the least of the delays introduces since it shouldn't require a lot of computation. 另外，您还需要一个循环来计算线程参数，该线程参数会随着生成更多线程而变得计算量更大，但这可能是引入最少的延迟，因为它不需要大量计算。

Also notice that threads may be running on a single physical core, check how your resources are being employed when the program is running. 另请注意，线程可能正在单个物理核心上运行，请在程序运行时检查资源的使用方式。 If the program only runs on a single core, then you are actually not using the HW acceleration introduced in having multiple cores. 如果程序仅在单个内核上运行，则实际上您没有使用在多个内核中引入的硬件加速。

Finally since this is C++ I suggest using the native std::thread. 最后，由于这是C ++，所以我建议使用本机std :: thread。

At the end I think this delay results mostly from the context switching between threads and the fact that the threads are probably running on a single core. 最后，我认为这种延迟主要是由于线程之间的上下文切换以及线程可能在单个内核上运行这一事实造成的。 Try checking the possibility of running the program on multiple physical cores and check how much time it takes. 尝试检查在多个物理内核上运行该程序的可能性，并检查所花费的时间。

为什么pthread会降低代码速度？

问题描述

2 个解决方案

解决方案1
3 已采纳 2018-06-21 10:35:12

解决方案2
1 2018-06-21 10:10:25

为什么pthread会降低代码速度？

问题描述

2 个解决方案

解决方案1 3 已采纳 2018-06-21 10:35:12

解决方案2 1 2018-06-21 10:10:25

解决方案1
3 已采纳 2018-06-21 10:35:12

解决方案2
1 2018-06-21 10:10:25