即使num_threads（1），openmp的性能改善也無法理解

Question

以下代碼行

int nrows = 4096;
int ncols = 4096;
size_t numel = nrows * ncols;
unsigned char *buff = (unsigned char *) malloc( numel );

unsigned char *pbuff = buff;
#pragma omp parallel for schedule(static), firstprivate(pbuff, nrows, ncols), num_threads(1)
for (int i=0; i<nrows; i++)
{
    for (int j=0; j<ncols; j++)
    {
        *pbuff += 1;
        pbuff++;
    }
}

編譯時需要11130個usecs在我的i5-3230M上運行

g++ -o main main.cpp -std=c++0x -O3

也就是說，當openmp編譯指示被忽略時。

另一方面，使用

g++ -o main main.cpp -std=c++0x -O3 -fopenmp

這快了6倍以上，考慮到它是在2核計算機上運行的，這非常令人驚訝。 實際上，我也用num_threads（1）對其進行了測試，並且性能提升仍然非常重要（快3倍以上）。

有人可以幫助我了解這種行為嗎？

編輯：按照建議，我提供完整的代碼：

#include <stdlib.h>
#include <iostream>

#include <chrono>
#include <cassert>


int nrows = 4096;
int ncols = 4096;
size_t numel = nrows * ncols;
unsigned char * buff;


void func()
{
    unsigned char *pbuff = buff;
    #pragma omp parallel for schedule(static), firstprivate(pbuff, nrows, ncols), num_threads(1)
    for (int i=0; i<nrows; i++)
    {
        for (int j=0; j<ncols; j++)
        {
            *pbuff += 1;
            pbuff++;
        }
    }
}


int main()
{
    // alloc & initializacion
    buff = (unsigned char *) malloc( numel );
    assert(buff != NULL);
    for(int k=0; k<numel; k++)
        buff[k] = 0;

    //
    std::chrono::high_resolution_clock::time_point begin;
    std::chrono::high_resolution_clock::time_point end;
    begin = std::chrono::high_resolution_clock::now();      
    //
    for(int k=0; k<100; k++)
        func();
    //
    end = std::chrono::high_resolution_clock::now();
    auto usec = std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count();
    std::cout << "func average running time: " << usec/100 << " usecs" << std::endl;

    return 0;
}

Answer 1

事實證明，答案是firstprivate(pbuff, nrows, ncols)有效地將pbuff ， nrows和ncols聲明為for循環范圍內的局部變量。 反過來，這意味着編譯器可以將nrows和ncols視為常量-它不能對全局變量做出相同的假設！

因此，使用-fopenmp會導致巨大的加速，因為您不必每次迭代都訪問全局變量 。 （此外，使用恆定的ncols值，編譯器可以進行一些循環展開）。

通過改變

int nrows = 4096;
int ncols = 4096;

至

const int nrows = 4096;
const int ncols = 4096;

或通過更改

for (int i=0; i<nrows; i++)
{
    for (int j=0; j<ncols; j++)
    {
        *pbuff += 1;
        pbuff++;
    }
}

至

int _nrows = nrows;
int _ncols = ncols;
for (int i=0; i<_nrows; i++)
{
    for (int j=0; j<_ncols; j++)
    {
        *pbuff += 1;
        pbuff++;
    }
}

異常加速消失-非OpenMP代碼現在與OpenMP代碼一樣快。

這個故事的主旨？ 避免在性能關鍵的循環中訪問可變的全局變量。

即使num_threads（1），openmp的性能改善也無法理解

問題描述

1 個解決方案

解決方案1
5 已采納 2015-06-08 08:53:29

即使num_threads（1），openmp的性能改善也無法理解

問題描述

1 個解決方案

解決方案1 5 已采納 2015-06-08 08:53:29

解決方案1
5 已采納 2015-06-08 08:53:29