仅在将数组设为私有之后才多线程加速

Question

I am trying to learn multi-threaded programming using openmp. 我正在尝试使用openmp学习多线程编程。

To begin with, I was testing out a nested loop with a large number of array access operations, and then parallelizing it. 首先，我正在测试具有大量数组访问操作的嵌套循环，然后对其进行并行化。 I am attaching the code below. 我附上下面的代码。 Basically, I have this fairly large array tmp in the interior loop, and if I make it shared so that every thread can access and change it, my code actually slows down with increasing number of threads. 基本上，我在内部循环中有一个相当大的数组tmp，如果共享它以便每个线程都可以访问和更改它，则我的代码实际上会随着线程数量的增加而变慢。 I have written it so that every thread writes the exact same values to array tmp. 我写了它，以便每个线程将完全相同的值写入数组tmp。 When I make tmp private, I get speed up proportional to the number of threads. 当我将tmp设为私有时，我得到的加速与线程数成正比。 The no. 没有 of operations seem to me to be exactly the same in both cases. 在两种情况下，我的操作次数似乎完全相同。 Why is it slowing down when tmp is shared ? 共享tmp时为什么会变慢？ Is it because different threads try to access the same address at the same time ? 是否因为不同的线程尝试同时访问相同的地址？

int main(){
    int k,m,n,dummy_cntr=5000,nthread=10,id;
    long num=10000000;
    double x[num],tmp[dummy_cntr];
    double tm,fact;
    clock_t st,fn;

    st=clock();
    omp_set_num_threads(nthread);
#pragma omp parallel private(tmp)
    {
        id = omp_get_thread_num();
        printf("Thread no. %d \n",id);
#pragma omp for
        for (k=0; k<num; k++){
            x[k]=k+1;
            for (m=0; m<dummy_cntr; m++){
                tmp[m] = m;
            }
        }
    }
    fn=clock();
    tm=(fn-st)/CLOCKS_PER_SEC;
}

PS: I am aware that using clock() here doesn't really give the correct time. PS：我知道在这里使用clock（）并不能给出正确的时间。 I have to divide it by the no. 我必须将其除以No。 of threads in this case to get a similar output as given by "time ./a.out". 这种情况下的线程数获得与“ time ./a.out”给定的类似输出。

Answer 1

这可能是由于缓存争用造成的 ：如果数组的一部分被两个或更多线程访问，它将被缓存多次，每个核心一个副本：当一个核心需要访问它时，如果数据已更改，则它需要从另一个核心缓存中获取最新版本，这需要一些时间。

Answer 2

Your code has race conditions in tmp and m . 您的代码在tmp和m具有竞争条件。 I don't know what you are really trying to do but this link might be helpful Fill histograms (array reduction) in parallel with OpenMP without using a critical section 我不知道您实际上要做什么，但是此链接可能会很有用，而无需使用关键部分与OpenMP并行填充直方图（减少数组）

I tried cleaning up your code. 我尝试清理您的代码。 This code allocates memory for tmp for each thread which solves your problem with false sharing in tmp . 该代码为每个线程的tmp分配内存，从而解决了tmp错误共享的问题。

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main() {
    int k,m,dummy_cntr=5000;
    long num=10000000;
    double *x, *tmp;
    double dtime;

    x = (double*)malloc(sizeof(double)*num);

    dtime = omp_get_wtime();
    #pragma omp parallel private(tmp, k)
    {
        tmp = (double*)malloc(sizeof(double)*dummy_cntr);
        #pragma omp for
        for (k=0; k<num; k++){
            x[k]=k+1;
            for (m=0; m<dummy_cntr; m++){
                tmp[m] = m;
            }
        }
        free(tmp);
    }
    dtime = omp_get_wtime() - dtime;
    printf("%f\n", dtime);
    free(x);
    return 0;
}

Compiled with 编译与

gcc -fopenmp -O3 -std=c89 -Wall -pedantic foo.c

仅在将数组设为私有之后才多线程加速

问题描述

2 个解决方案

解决方案1
5 已采纳 2013-06-19 15:25:31

解决方案2
1

仅在将数组设为私有之后才多线程加速

问题描述

2 个解决方案

解决方案1 5 已采纳 2013-06-19 15:25:31

解决方案2 1

解决方案1
5 已采纳 2013-06-19 15:25:31

解决方案2
1