並行化 for 循環不會帶來性能提升

Question

我有一個將拜耳圖像通道轉換為 RGB 的算法。 在我的實現中，我有一個嵌套的for循環，它遍歷 bayer 通道，從 bayer 索引計算 rgb 索引，然后從 bayer 通道設置該像素的值。 這里要注意的主要事情是每個像素都可以獨立於其他像素計算（不依賴於先前的計算），因此該算法是並行化的自然候選者。 然而，計算確實依賴於所有線程將同時訪問但不會更改的一些預設數組。

但是，當我嘗試將主要for與 MS 的cuncurrency::parallel_for我的性能沒有得到提升。 事實上，對於在 4 核 CPU 上運行的大小為 3264X2540 的輸入，非並行版本運行時間約為 34 毫秒，並行版本運行時間約為 69 毫秒（平均超過 10 次運行）。 我確認該操作確實是並行化的（為該任務創建了 3 個新線程）。

將 Intel 的編譯器與tbb::parallel_for一起使用tbb::parallel_for接近准確的結果。 為了進行比較，我開始使用在C#中實現的這個算法，其中我還使用了parallel_for循環，並且在那里我遇到了接近 X4 的性能提升（我選擇了C++因為對於這個特定任務， C++即使使用單核也更快）。

有什么想法阻止我的代碼很好地並行化嗎？

我的代碼：

template<typename T>
void static ConvertBayerToRgbImageAsIs(T* BayerChannel, T* RgbChannel, int Width, int Height, ColorSpace ColorSpace)
{
        //Translates index offset in Bayer image to channel offset in RGB image
        int offsets[4];
        //calculate offsets according to color space
        switch (ColorSpace)
        {
        case ColorSpace::BGGR:
            offsets[0] = 2;
            offsets[1] = 1;
            offsets[2] = 1;
            offsets[3] = 0;
            break;
        ...other color spaces
        }
        memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
        parallel_for(0, Height, [&] (int row)
        {
            for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
            {
                auto offset = (row%2)*2 + (col%2); //0...3
                auto rgbIndex = bayerIndex * 3 + offsets[offset];
                RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
            }
        });
}

Answer 1

首先，您的算法是 memory bandwidth bounded 。 也就是說，內存加載/存儲將超過您所做的任何索引計算。

像SSE / AVX這樣的向量運算也無濟於事 - 您沒有進行任何密集計算。

增加每次迭代的工作量也是無用的PPL和TBB都足夠聰明，不會在每次迭代中創建線程，他們會使用一些好的分區，這會另外嘗試保留局部性。 例如，這是TBB::parallel_for引用：

當工作線程可用時， parallel_for執行迭代的順序是不確定的。 不要依賴任何特定的執行順序來確保正確性。 但是，為了效率，確實希望 parallel_for 傾向於對連續運行的 values 進行操作。

真正重要的是減少內存操作。 任何對輸入或輸出緩沖區的多余遍歷都會影響性能，因此您應該嘗試刪除memset或並行執行。

您正在完全遍歷輸入和輸出數據。 即使您跳過輸出中的某些內容 - 這也無關緊要，因為在現代硬件中，內存操作是由 64 字節塊進行的。 因此，計算輸入和輸出的size ，測量算法的time ，划分size / time並將結果與系統的最大特征進行比較（例如，使用benchmark 進行測量）。

我已經對Microsoft PPL 、 OpenMP和Native for進行了測試，結果是（我使用了你身高的 8 倍）：

Native_For       0.21 s
OpenMP_For       0.15 s
Intel_TBB_For    0.15 s
MS_PPL_For       0.15 s

如果刪除memset則：

Native_For       0.15 s
OpenMP_For       0.09 s
Intel_TBB_For    0.09 s
MS_PPL_For       0.09 s

如您所見， memset （高度優化）負責大量執行時間，這顯示了您的算法是如何受內存限制的。

完整源代碼：

#include <boost/exception/detail/type_info.hpp>
#include <boost/mpl/for_each.hpp>
#include <boost/mpl/vector.hpp>
#include <boost/progress.hpp>
#include <tbb/tbb.h>
#include <iostream>
#include <ostream>
#include <vector>
#include <string>
#include <omp.h>
#include <ppl.h>

using namespace boost;
using namespace std;

const auto Width = 3264;
const auto Height = 2540*8;

struct MS_PPL_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        concurrency::parallel_for(first,last,f);
    }
};

struct Intel_TBB_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        tbb::parallel_for(first,last,f);
    }
};

struct Native_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        for(; first!=last; ++first) f(first);
    }
};

struct OpenMP_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        #pragma omp parallel for
        for(auto i=first; i<last; ++i) f(i);
    }
};

template<typename T>
struct ConvertBayerToRgbImageAsIs
{
    const T* BayerChannel;
    T* RgbChannel;
    template<typename For>
    void operator()(For for_)
    {
        cout << type_name<For>() << "\t";
        progress_timer t;
        int offsets[] = {2,1,1,0};
        //memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
        for_(0, Height, [&] (int row)
        {
            for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
            {
                auto offset = (row % 2)*2 + (col % 2); //0...3
                auto rgbIndex = bayerIndex * 3 + offsets[offset];
                RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
            }
        });
    }
};

int main()
{
    vector<float> bayer(Width*Height);
    vector<float> rgb(Width*Height*3);
    ConvertBayerToRgbImageAsIs<float> work = {&bayer[0],&rgb[0]};
    for(auto i=0;i!=4;++i)
    {
        mpl::for_each<mpl::vector<Native_For, OpenMP_For,Intel_TBB_For,MS_PPL_For>>(work);
        cout << string(16,'_') << endl;
    }
}

Answer 2

同步開銷

我猜想循環每次迭代完成的工作量太小了。 如果您將圖像分成四個部分並並行運行計算，您會注意到一個很大的增益。 嘗試以減少迭代次數和每次迭代更多工作的方式設計循環。 這背后的原因是完成了太多的同步。

緩存使用

一個重要的因素可能是數據如何拆分（分區）以進行處理。 如果處理的行像下面的壞情況一樣被分開，那么更多的行將導致緩存未命中。 隨着每增加一個線程，這種效果將變得更加重要，因為行之間的距離會更大。 如果您確定並行化函數執行合理的分區，那么手動工作拆分將不會產生任何結果

 bad       good
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t2
****** t2 ****** t2
****** t1 ****** t2
****** t2 ****** t2

還要確保您以與對齊的方式相同的方式訪問您的數據； 對offset[]和BayerChannel[]每次調用都可能是緩存未命中。 您的算法非常占用內存。 幾乎所有操作都是訪問內存段或寫入內存段。 防止緩存未命中和最小化內存訪問至關重要。

代碼優化

下面顯示的優化可能由編譯器完成，可能不會給出更好的結果。 值得知道的是，它們是可以做到的。

    // is the memset really necessary?
    //memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
    parallel_for(0, Height, [&] (int row)
    {
        int rowMod = (row & 1) << 1;
        for (auto col = 0, bayerIndex = row * Width, tripleBayerIndex=row*Width*3; col < Width; col+=2, bayerIndex+=2, tripleBayerIndex+=6)
        {
            auto rgbIndex = tripleBayerIndex + offsets[rowMod];
            RgbChannel[rgbIndex] = BayerChannel[bayerIndex];

            //unrolled the loop to save col & 1 operation
            rgbIndex = tripleBayerIndex + 3 + offsets[rowMod+1];
            RgbChannel[rgbIndex] = BayerChannel[bayerIndex+1];
        }
    });

Answer 3

我的建議來了：

並行計算更大的塊
擺脫模/乘法

展開內部循環以計算一個完整像素（簡化代碼）

 template<typename T> void static ConvertBayerToRgbImageAsIsNew(T* BayerChannel, T* RgbChannel, int Width, int Height) { // convert BGGR->RGB // have as many threads as the hardware concurrency is parallel_for(0, Height, static_cast<int>(Height/(thread::hardware_concurrency())), [&] (int stride) { for (auto row = stride; row<2*stride; row++) { for (auto col = row*Width, rgbCol =row*Width; col < row*Width+Width; rgbCol +=3, col+=4) { RgbChannel[rgbCol+0] = BayerChannel[col+3]; RgbChannel[rgbCol+1] = BayerChannel[col+1]; // RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted RgbChannel[rgbCol+2] = BayerChannel[col+0]; } } }); }

這段代碼比原始版本快 60%，但仍然只有我筆記本電腦上非並行版本的一半。 正如其他人已經指出的那樣，這似乎是由於算法的內存有界。

編輯：但我對此並不滿意。 從parallel_for到std::async時，我可以大大提高並行性能：

int hc = thread::hardware_concurrency();
future<void>* res = new future<void>[hc];
for (int i = 0; i<hc; ++i)
{
    res[i] = async(Converter<char>(bayerChannel, rgbChannel, rows, cols, rows/hc*i, rows/hc*(i+1)));
}
for (int i = 0; i<hc; ++i)
{
    res[i].wait();
}
delete [] res;

轉換器是一個簡單的類：

template <class T> class Converter
{
public:
Converter(T* BayerChannel, T* RgbChannel, int Width, int Height, int startRow, int endRow) : 
    BayerChannel(BayerChannel), RgbChannel(RgbChannel), Width(Width), Height(Height), startRow(startRow), endRow(endRow)
{
}
void operator()()
{
    // convert BGGR->RGB
    for(int row = startRow; row < endRow; row++)
    {
        for (auto col = row*Width, rgbCol =row*Width; col < row*Width+Width; rgbCol +=3, col+=4)
        {
            RgbChannel[rgbCol+0]  = BayerChannel[col+3];
            RgbChannel[rgbCol+1]  = BayerChannel[col+1];
            // RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
            RgbChannel[rgbCol+2]  = BayerChannel[col+0];
        }
    };
}
private:
T* BayerChannel;
T* RgbChannel;
int Width;
int Height;
int startRow;
int endRow;
};

這現在比非並行化版本快 3.5 倍。 從我目前在分析器中看到的情況來看，我認為 parallel_for 的工作竊取方法會導致大量等待和同步開銷。

Answer 4

我沒有使用 tbb::parallel_for 不是 cuncurrency::parallel_for，但如果你的數字是正確的，它們似乎會帶來太多的開銷。 但是，我強烈建議您在測試時運行 10 次以上的迭代，並確保在計時之前進行盡可能多的預熱迭代。

我完全使用三種不同的方法測試了您的代碼，平均嘗試超過 1000 次。

Serial:      14.6 += 1.0  ms
std::async:  13.6 += 1.6  ms
workers:     11.8 += 1.2  ms

首先是串行計算。 第二個是使用對 std::async 的四次調用完成的。 最后一個是通過將四個作業發送到四個已經啟動（但處於休眠狀態）的后台線程來完成的。

收獲不大，但至少是收獲。 我在 2012 款 MacBook Pro 上進行了測試，雙超線程內核 = 4 個邏輯內核。

作為參考，這是我的 std::async 並行：

template<typename Int=int, class Fun>
void std_par_for(Int beg, Int end, const Fun& fun)
{
    auto N = std::thread::hardware_concurrency();
    std::vector<std::future<void>> futures;

    for (Int ti=0; ti<N; ++ti) {
        Int b = ti * (end - beg) / N;
        Int e = (ti+1) * (end - beg) / N;
        if (ti == N-1) { e = end; }

        futures.emplace_back( std::async([&,b,e]() {
            for (Int ix=b; ix<e; ++ix) {
                fun( ix );
            }
        }));
    }

    for (auto&& f : futures) {
        f.wait();
    }
}

Answer 5

需要檢查或做的事情

您使用的是 Core 2 或更舊的處理器嗎？ 它們有一個非常窄的內存總線，很容易被這樣的代碼飽和。 相比之下，4 通道 Sandy Bridge-E 處理器需要多個線程來使內存總線飽和（單個內存綁定線程不可能使其完全飽和）。
您是否已填充所有內存通道？ 例如，如果您有一個雙通道 CPU，但只安裝了一個 RAM 卡或兩個在同一通道上，您將獲得可用帶寬的一半。
你如何計時你的代碼？
- 計時應該在應用程序內完成，就像 Evgeny Panasyuk 建議的那樣。
- 您應該在同一個應用程序中進行多次運行。 否則，您可能會計時一次性啟動代碼來啟動線程池等。
正如其他人所解釋的那樣，刪除多余的memset 。
正如 ogni42 和其他人所建議的那樣，展開您的內部循環（我沒有費心檢查該解決方案的正確性，但如果它是錯誤的，您應該能夠修復它）。 這與並行化的主要問題是正交的，但無論如何這是一個好主意。
在進行性能測試時，請確保您的機器處於空閑狀態。

額外的時間

我已經將 Evgeny Panasyuk 和 ogni42 的建議合並到了一個簡單的 C++03 Win32 實現中：

#include "stdafx.h"

#include <omp.h>
#include <vector>
#include <iostream>
#include <stdio.h>

using namespace std;

const int Width = 3264;
const int Height = 2540*8;

class Timer {
private:
    string name;
    LARGE_INTEGER start;
    LARGE_INTEGER stop;
    LARGE_INTEGER frequency;
public:
    Timer(const char *name) : name(name) {
        QueryPerformanceFrequency(&frequency);
        QueryPerformanceCounter(&start);
    }

    ~Timer() {
        QueryPerformanceCounter(&stop);
        LARGE_INTEGER time;
        time.QuadPart = stop.QuadPart - start.QuadPart;
        double elapsed = ((double)time.QuadPart /(double)frequency.QuadPart);
        printf("%-20s : %5.2f\n", name.c_str(), elapsed);
    }
};

static const int offsets[] = {2,1,1,0};

template <typename T>
void Inner_Orig(const T* BayerChannel, T* RgbChannel, int row)
{
    for (int col = 0, bayerIndex = row * Width;
         col < Width; col++, bayerIndex++)
    {
        int offset = (row % 2)*2 + (col % 2); //0...3
        int rgbIndex = bayerIndex * 3 + offsets[offset];
        RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
    }
}

// adapted from ogni42's answer
template <typename T>
void Inner_Unrolled(const T* BayerChannel, T* RgbChannel, int row)
{
    for (int col = row*Width, rgbCol =row*Width;
         col < row*Width+Width; rgbCol +=3, col+=4)
    {
        RgbChannel[rgbCol+0]  = BayerChannel[col+3];
        RgbChannel[rgbCol+1]  = BayerChannel[col+1];
        // RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
        RgbChannel[rgbCol+2]  = BayerChannel[col+0];
    }
}

int _tmain(int argc, _TCHAR* argv[])
{
    vector<float> bayer(Width*Height);
    vector<float> rgb(Width*Height*3);
    for(int i = 0; i < 4; ++i)
    {
        {
            Timer t("serial_orig");
            for(int row = 0; row < Height; ++row) {
                Inner_Orig<float>(&bayer[0], &rgb[0], row);
            }
        }
        {
            Timer t("omp_dynamic_orig");
            #pragma omp parallel for
            for(int row = 0; row < Height; ++row) {
                Inner_Orig<float>(&bayer[0], &rgb[0], row);
            }
        }
        {
            Timer t("omp_static_orig");
            #pragma omp parallel for schedule(static)
            for(int row = 0; row < Height; ++row) {
                Inner_Orig<float>(&bayer[0], &rgb[0], row);
            }
        }

        {
            Timer t("serial_unrolled");
            for(int row = 0; row < Height; ++row) {
                Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
            }
        }
        {
            Timer t("omp_dynamic_unrolled");
            #pragma omp parallel for
            for(int row = 0; row < Height; ++row) {
                Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
            }
        }
        {
            Timer t("omp_static_unrolled");
            #pragma omp parallel for schedule(static)
            for(int row = 0; row < Height; ++row) {
                Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
            }
        }
        printf("-----------------------------\n");
    }
    return 0;
}

以下是我在三通道 8 路超線程 Core i7-950 盒子上看到的時序：

serial_orig          :  0.13
omp_dynamic_orig     :  0.10
omp_static_orig      :  0.10
serial_unrolled      :  0.06
omp_dynamic_unrolled :  0.04
omp_static_unrolled  :  0.04

“靜態”版本告訴編譯器在循環入口處平均分配線程之間的工作。 這避免了嘗試進行工作竊取或其他動態負載平衡的開銷。 對於這個代碼片段，它似乎沒有什么區別，即使跨線程的工作負載非常均勻。

Answer 6

性能下降可能會發生，因為您試圖在“行”數的內核上分配 for 循環，這些內核將不可用，因此它再次變得像具有並行開銷的順序執行。

Answer 7

對並行 for 循環不是很熟悉，但在我看來，爭用是在內存訪問中。 看來您的線程正在重疊訪問相同的頁面。

你能把你的數組訪問分成 4k 個與頁面邊界對齊的塊嗎？

Answer 8

在沒有優化串行代碼的 for 循環之前，談論並行性能毫無意義。 這是我的嘗試（一些好的編譯器可能能夠獲得類似的優化版本，但我寧願不依賴它）

    parallel_for(0, Height, [=] (int row) noexcept
    {
        for (auto col=0, bayerindex=row*Width,
                  rgb0=3*bayerindex+offset[(row%2)*2],
                  rgb1=3*bayerindex+offset[(row%2)*2+1];
             col < Width; col+=2, bayerindex+=2, rgb0+=6, rgb1+=6 )
        {
            RgbChannel[rgb0] = BayerChannel[bayerindex  ];
            RgbChannel[rgb1] = BayerChannel[bayerindex+1];
        }
    });

並行化 for 循環不會帶來性能提升

問題描述

8 個解決方案

解決方案1
22 已采納 2013-04-15 18:08:55

解決方案2
5 2013-04-15 08:42:10

同步開銷

緩存使用

代碼優化

解決方案3
3 2013-04-19 09:10:59

解決方案4
2 2013-04-15 13:05:43

解決方案5
2 2013-04-21 12:52:23

解決方案6
0 2013-04-15 11:02:23

解決方案7
0 2013-04-15 11:23:05

解決方案8
0 2013-04-15 16:09:25

並行化 for 循環不會帶來性能提升

問題描述

8 個解決方案

解決方案1 22 已采納 2013-04-15 18:08:55

解決方案2 5 2013-04-15 08:42:10

同步開銷

緩存使用

代碼優化

解決方案3 3 2013-04-19 09:10:59

解決方案4 2 2013-04-15 13:05:43

解決方案5 2 2013-04-21 12:52:23

解決方案6 0 2013-04-15 11:02:23

解決方案7 0 2013-04-15 11:23:05

解決方案8 0 2013-04-15 16:09:25

解決方案1
22 已采納 2013-04-15 18:08:55

解決方案2
5 2013-04-15 08:42:10

解決方案3
3 2013-04-19 09:10:59

解決方案4
2 2013-04-15 13:05:43

解決方案5
2 2013-04-21 12:52:23

解決方案6
0 2013-04-15 11:02:23

解決方案7
0 2013-04-15 11:23:05

解決方案8
0 2013-04-15 16:09:25