在這種情況下，為什么PPL明顯慢於順序循環和OpenMP

Question

繼我對CodeReview的問題，我想知道為什么使用std::plus<int>的兩個向量的簡單變換的PPL實現比順序std::transform慢得多，並且使用帶有OpenMP的for循環（順序（使用矢量化）：25ms，順序（無矢量化）：28ms，C ++ AMP：131ms，PPL：51ms，OpenMP：24ms）。

我使用以下代碼進行性能分析，並在Visual Studio 2013中使用完全優化進行編譯：

#include <amp.h>
#include <iostream>
#include <numeric>
#include <random>
#include <assert.h>
#include <functional>
#include <chrono>

using namespace concurrency;

const std::size_t size = 30737418;

//----------------------------------------------------------------------------
// Program entry point.
//----------------------------------------------------------------------------
int main( )
{
    accelerator default_device;
    std::wcout << "Using device : " << default_device.get_description( ) << std::endl;
    if( default_device == accelerator( accelerator::direct3d_ref ) )
        std::cout << "WARNING!! Running on very slow emulator! Only use this accelerator for debugging." << std::endl;

    std::mt19937 engine;
    std::uniform_int_distribution<int> dist( 0, 10000 );

    std::vector<int> vecTest( size );
    std::vector<int> vecTest2( size );
    std::vector<int> vecResult( size );

    for( int i = 0; i < size; ++i )
    {
        vecTest[i] = dist( engine );
        vecTest2[i] = dist( engine );
    }

    std::vector<int> vecCorrectResult( size );

    std::chrono::high_resolution_clock clock;
    auto beginTime = clock.now();

    std::transform( std::begin( vecTest ), std::end( vecTest ), std::begin( vecTest2 ), std::begin( vecCorrectResult ), std::plus<int>() );

    auto endTime = clock.now();
    auto timeTaken = endTime - beginTime;

    std::cout << "The time taken for the sequential function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;

    beginTime = clock.now();

#pragma loop(no_vector)
    for( int i = 0; i < size; ++i )
    {
        vecResult[i] = vecTest[i] + vecTest2[i];
    }

    endTime = clock.now();
    timeTaken = endTime - beginTime;

    std::cout << "The time taken for the sequential function (with auto-vectorization disabled) to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;

    beginTime = clock.now();

    concurrency::array_view<const int, 1> av1( vecTest );
    concurrency::array_view<const int, 1> av2( vecTest2 );
    concurrency::array_view<int, 1> avResult( vecResult );
    avResult.discard_data();

    concurrency::parallel_for_each( avResult.extent, [=]( concurrency::index<1> index ) restrict(amp) {
        avResult[index] = av1[index] + av2[index];
    } );

    avResult.synchronize();
    endTime = clock.now();
    timeTaken = endTime - beginTime;

    std::cout << "The time taken for the AMP function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;
    std::cout << std::boolalpha << "The AMP function generated the correct answer: " << (vecResult == vecCorrectResult) << std::endl;

    beginTime = clock.now();

    concurrency::parallel_transform( std::begin( vecTest ), std::end( vecTest ), std::begin( vecTest2 ), std::begin( vecResult ), std::plus<int>() );

    endTime = clock.now();
    timeTaken = endTime - beginTime;

    std::cout << "The time taken for the PPL function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;
    std::cout << "The PPL function generated the correct answer: " << (vecResult == vecCorrectResult) << std::endl;

    beginTime = clock.now();

#pragma omp parallel
#pragma omp for
    for( int i = 0; i < size; ++i )
    {
        vecResult[i] = vecTest[i] + vecTest2[i];
    }

    endTime = clock.now();
    timeTaken = endTime - beginTime;

    std::cout << "The time taken for the OpenMP function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;
    std::cout << "The OpenMP function generated the correct answer: " << (vecResult == vecCorrectResult) << std::endl;

    return 0;
}

Answer 1

根據MSDN， concurrency::parallel_transform的默認分區程序是concurrency::auto_partitioner 。 當涉及到它：

這種分區方法采用范圍竊取來實現負載平衡以及每次迭代取消。

使用這個分區器對於簡單（和內存限制）操作來說是一種過度殺傷，例如對兩個數組求和，因為開銷很大。 您應該使用concurrency::static_partitioner 。 靜態分區正是大多數OpenMP實現在for結構中缺少schedule子句時默認使用的。

正如Code Review上已經提到的，這是一個非常受內存限制的代碼。 它也是STREAM基准測試的SUM內核，專門用於測量運行系統的內存帶寬。 a[i] = b[i] + c[i]操作具有非常低的操作強度（以OPS /字節測量），並且其速度僅由主存儲器總線的帶寬確定。 這就是為什么OpenMP代碼和矢量化串行代碼提供基本相同的性能，這並不比非矢量化串行代碼的性能高得多。

獲得更高並行性能的方法是在現代多插槽系統上運行代碼，並使每個陣列中的數據均勻分布在套接字上。 然后你可以獲得幾乎等於CPU插座數量的加速。

在這種情況下，為什么PPL明顯慢於順序循環和OpenMP

問題描述

1 個解決方案

解決方案1
4 2014-07-07 10:59:15

在這種情況下，為什么PPL明顯慢於順序循環和OpenMP

問題描述

1 個解決方案

解決方案1 4 2014-07-07 10:59:15

解決方案1
4 2014-07-07 10:59:15