tbb :: parallel_reduce vs tbb :: combinable vs tbb :: enumerable_thread_specific

Question

I want to go through an image and process some specific values with regard to the order of the elements. 我想浏览一幅图像并处理有关元素顺序的一些特定值。 The image has one unsigned char* array containing a mask(255 if pixel should be processed, else 0) and an unsigned short* array with the pixel values. 该图像具有一个unsigned char*数组，该数组包含一个掩码（如果应处理像素，则为255；否则为0）；以及一个具有像素值的unsigned short*数组。

I implemented three different methods with tbb and used a single for-loop through the mask-array and calculated the x,y-coordinates from the loop-variable: x = i%width; y = i/width; 我用tbb实现了三种不同的方法，并在掩码数组中使用了一个for循环，并根据循环变量计算了x，y坐标： x = i%width; y = i/width; x = i%width; y = i/width; . 。 If the pixel is visible i want to transform the point using Eigen . 如果像素可见，我想使用Eigen变换点。 The vector4d is a std::vector<std::array<double,4>> to store the points. vector4d是std::vector<std::array<double,4>>用于存储点。

Here are my three implementaion with tbb: 这是我用tbb实现的三个方法：

1. tbb::combinable and tbb::parallel_for : 1. tbb::combinable和tbb::parallel_for ：

void Combinable(int width, int height, unsigned char* mask,unsigned short*  pixel){ 
    MyCombinableType.clear();
    MyCombinableType.local().reserve(width*height);
    tbb::parallel_for( tbb::blocked_range<int>(0, width*height),
        [&](const tbb::blocked_range<int> &r) 
    {       
        vector4d& local = MyCombinableType.local(); 
        const size_t end = r.end(); 
        for (int i = r.begin(); i != end; ++i)
        {
            if(mask[i]!=0)
            {                                       
                array4d arr = {i%width,i/width,(double)pixel[i],1}; 
                //Map with Eigen and transform
                local.push_back(arr);           
            }
        }
    });

    vector4d idx = MyCombinableType.combine(
        []( vector4d x, vector4d y) 
    {               
        std::size_t n = x.size();
        x.resize(n + y.size());
        std::move(y.begin(), y.end(), x.begin() + n);
        return x;
    });
}

2. tbb::enumerable_thread_specific and tbb::parallel_for : 2. tbb::enumerable_thread_specific和tbb::parallel_for ：

void Enumerable(int width, int height, unsigned char* mask,unsigned short*  pixel){
    MyEnumerableType.clear();
    MyEnumerableType.local().reserve(width*height);
    tbb::parallel_for( tbb::blocked_range<int>(0, width*height),
        [&](const tbb::blocked_range<int> &r) 
    {
        enumerableType::reference local = MyEnumerableType.local();
        for (int i = r.begin(); i != r.end(); ++i)
        {
            if(mask[i]!=0)
            {
                array4d arr = {i%width,i/width,(double)pixel[i],1}; 
                //Map with Eigen and transform
                local.push_back(arr);               

            }
        }
    });

    vector4d idx = MyEnumerableType.combine(
        [](vector4d x, vector4d y) 
    {           
        std::size_t n = x.size();
        x.resize(n + y.size());
        std::move(y.begin(), y.end(), x.begin() + n);
        return x;
    });
}

3. tbb::parallel_reduce : 3. tbb::parallel_reduce ：

void Reduce(int width, int height, unsigned char* mask,unsigned short*  pixel){
    vector4d idx = tbb::parallel_reduce(
        tbb::blocked_range<int>(0, width*height ),vector4d(),
            [&](const tbb::blocked_range<int>& r, vector4d init)->vector4d 
        {
            const size_t end = r.end(); 
            init.reserve(r.size());
            for( int i=r.begin(); i!=end; ++i )
            {   
                if(mask[i]!=0)
                {               
                    array4d arr = {i%width,i/width,(double)pixel[i],1}; 
                    //Map with Eigen and transform
                    init.push_back(arr);            
                }
            }
            return init;
        },
        []( vector4d x,vector4d y )
        {
            std::size_t n = x.size();
            x.resize(n + y.size());
            std::move(y.begin(), y.end(), x.begin() + n);           
            return x;
        }
    );  
}

I compared the runtime of the three versions with a serial implementation. 我将这三个版本的运行时与串行实现进行了比较。 The arrays had 8400000 elements and every algortihm was repeated 100 times. 数组有840万个元素，每个算法重复100次。 The results are: 结果是：

Serial: ~170ms 序列：〜170ms
Enumerable: ~118ms 可枚举：〜118ms
Combinable: ~116ms 可组合：〜116ms
Reduce: ~720ms 减少：〜720ms

I assume that the combine statement is the bottleneck here. 我认为combine语句是这里的瓶颈。 What am i doing wrong? 我究竟做错了什么？ Why is parallel_reduce soo much slower? 为什么parallel_reduce这么慢？ Please help! 请帮忙！

Answer 1

There are few optimizations you can apply here. 您可以在此处应用一些优化。

avoid excessive copying: pass const vector4d& instead, use [&] lambdas everywhere. 避免过度复制：通过const vector4d&代替，在任何地方都使用[&] lambda。
Use temporary vector4d on stack instead of resizing one of the arguments and use it for return statement. 在堆栈上使用临时的vector4d ，而不是调整参数之一的大小并将其用于return语句。
Generally, use blocked_range2d instead of calculating x = i%width; y = i/width 通常，使用blocked_range2d代替计算x = i%width; y = i/width x = i%width; y = i/width . x = i%width; y = i/width 。 This is not only optimizes out excessive computations but, which is much more important, it optimizes cache access pattern that might improve cache usage (not in this case though). 这不仅可以优化过多的计算，而且更为重要的是，它还可以优化可能会提高缓存使用率的缓存访问模式（尽管在这种情况下不是）。

Answer 2

You are using the functional form of parallel_reduce, try the more efficient imperative form instead. 您正在使用parallel_reduce的函数形式，请尝试使用更有效的命令式形式。 Unfortunately it cannot be called using lambdas, you must define a Body class: 不幸的是，不能使用lambda调用它，必须定义一个Body类：

https://www.threadingbuildingblocks.org/docs/help/reference/algorithms/parallel_reduce_func.html https://www.threadingbuildingblocks.org/docs/help/reference/algorithms/parallel_reduce_func.html

It should minimize the number vector4d copies that are made during your reduction. 它应减少还原过程中生成的vector4d副本数量。 The vector4d should be a member of your Body class so that it can be reused and appended to by multiple ranges, rather than constructing and merging a unique vector4d for every subdivided range. vector4d应该是您的Body类的成员，这样它才能被多个范围重用和附加，而不是为每个细分范围构造和合并唯一的vector4d。

(Note: the splitting constructor should NOT copy the contents of the vector4d member, notice how value is always initialized to 0 in intel's example above.) （注意：拆分构造函数不应复制vector4d成员的内容，请注意，在上面的intel示例中， value始终如何初始化为0。）

tbb :: parallel_reduce vs tbb :: combinable vs tbb :: enumerable_thread_specific

问题描述

2 个解决方案

解决方案1
0 2016-08-25 15:23:28

解决方案2
0 2016-09-19 13:06:23

tbb :: parallel_reduce vs tbb :: combinable vs tbb :: enumerable_thread_specific

问题描述

2 个解决方案

解决方案1 0 2016-08-25 15:23:28

解决方案2 0 2016-09-19 13:06:23

解决方案1
0 2016-08-25 15:23:28

解决方案2
0 2016-09-19 13:06:23