简体   繁体   English

tbb :: parallel_reduce vs tbb :: combinable vs tbb :: enumerable_thread_specific

[英]tbb::parallel_reduce vs tbb::combinable vs tbb::enumerable_thread_specific

I want to go through an image and process some specific values with regard to the order of the elements. 我想浏览一幅图像并处理有关元素顺序的一些特定值。 The image has one unsigned char* array containing a mask(255 if pixel should be processed, else 0) and an unsigned short* array with the pixel values. 该图像具有一个unsigned char*数组,该数组包含一个掩码(如果应处理像素,则为255;否则为0);以及一个具有像素值的unsigned short*数组。

I implemented three different methods with tbb and used a single for-loop through the mask-array and calculated the x,y-coordinates from the loop-variable: x = i%width; y = i/width; 我用tbb实现了三种不同的方法,并在掩码数组中使用了一个for循环,并根据循环变量计算了x,y坐标: x = i%width; y = i/width; x = i%width; y = i/width; . If the pixel is visible i want to transform the point using Eigen . 如果像素可见,我想使用Eigen变换点。 The vector4d is a std::vector<std::array<double,4>> to store the points. vector4dstd::vector<std::array<double,4>>用于存储点。

Here are my three implementaion with tbb: 这是我用tbb实现的三个方法:

1. tbb::combinable and tbb::parallel_for : 1. tbb::combinabletbb::parallel_for

void Combinable(int width, int height, unsigned char* mask,unsigned short*  pixel){ 
    MyCombinableType.clear();
    MyCombinableType.local().reserve(width*height);
    tbb::parallel_for( tbb::blocked_range<int>(0, width*height),
        [&](const tbb::blocked_range<int> &r) 
    {       
        vector4d& local = MyCombinableType.local(); 
        const size_t end = r.end(); 
        for (int i = r.begin(); i != end; ++i)
        {
            if(mask[i]!=0)
            {                                       
                array4d arr = {i%width,i/width,(double)pixel[i],1}; 
                //Map with Eigen and transform
                local.push_back(arr);           
            }
        }
    });

    vector4d idx = MyCombinableType.combine(
        []( vector4d x, vector4d y) 
    {               
        std::size_t n = x.size();
        x.resize(n + y.size());
        std::move(y.begin(), y.end(), x.begin() + n);
        return x;
    });
}

2. tbb::enumerable_thread_specific and tbb::parallel_for : 2. tbb::enumerable_thread_specifictbb::parallel_for

void Enumerable(int width, int height, unsigned char* mask,unsigned short*  pixel){
    MyEnumerableType.clear();
    MyEnumerableType.local().reserve(width*height);
    tbb::parallel_for( tbb::blocked_range<int>(0, width*height),
        [&](const tbb::blocked_range<int> &r) 
    {
        enumerableType::reference local = MyEnumerableType.local();
        for (int i = r.begin(); i != r.end(); ++i)
        {
            if(mask[i]!=0)
            {
                array4d arr = {i%width,i/width,(double)pixel[i],1}; 
                //Map with Eigen and transform
                local.push_back(arr);               

            }
        }
    });

    vector4d idx = MyEnumerableType.combine(
        [](vector4d x, vector4d y) 
    {           
        std::size_t n = x.size();
        x.resize(n + y.size());
        std::move(y.begin(), y.end(), x.begin() + n);
        return x;
    });
}

3. tbb::parallel_reduce : 3. tbb::parallel_reduce

void Reduce(int width, int height, unsigned char* mask,unsigned short*  pixel){
    vector4d idx = tbb::parallel_reduce(
        tbb::blocked_range<int>(0, width*height ),vector4d(),
            [&](const tbb::blocked_range<int>& r, vector4d init)->vector4d 
        {
            const size_t end = r.end(); 
            init.reserve(r.size());
            for( int i=r.begin(); i!=end; ++i )
            {   
                if(mask[i]!=0)
                {               
                    array4d arr = {i%width,i/width,(double)pixel[i],1}; 
                    //Map with Eigen and transform
                    init.push_back(arr);            
                }
            }
            return init;
        },
        []( vector4d x,vector4d y )
        {
            std::size_t n = x.size();
            x.resize(n + y.size());
            std::move(y.begin(), y.end(), x.begin() + n);           
            return x;
        }
    );  
}

I compared the runtime of the three versions with a serial implementation. 我将这三个版本的运行时与串行实现进行了比较。 The arrays had 8400000 elements and every algortihm was repeated 100 times. 数组有840万个元素,每个算法重复100次。 The results are: 结果是:

  • Serial: ~170ms 序列:〜170ms
  • Enumerable: ~118ms 可枚举:〜118ms
  • Combinable: ~116ms 可组合:〜116ms
  • Reduce: ~720ms 减少:〜720ms

I assume that the combine statement is the bottleneck here. 我认为combine语句是这里的瓶颈。 What am i doing wrong? 我究竟做错了什么? Why is parallel_reduce soo much slower? 为什么parallel_reduce这么慢? Please help! 请帮忙!

There are few optimizations you can apply here. 您可以在此处应用一些优化。

  1. avoid excessive copying: pass const vector4d& instead, use [&] lambdas everywhere. 避免过度复制:通过const vector4d&代替,在任何地方都使用[&] lambda。
  2. Use temporary vector4d on stack instead of resizing one of the arguments and use it for return statement. 在堆栈上使用临时的vector4d ,而不是调整参数之一的大小并将其用于return语句。
  3. Generally, use blocked_range2d instead of calculating x = i%width; y = i/width 通常,使用blocked_range2d代替计算x = i%width; y = i/width x = i%width; y = i/width . x = i%width; y = i/width This is not only optimizes out excessive computations but, which is much more important, it optimizes cache access pattern that might improve cache usage (not in this case though). 这不仅可以优化过多的计算,而且更为重要的是,它还可以优化可能会提高缓存使用率的缓存访问模式(尽管在这种情况下不是)。

You are using the functional form of parallel_reduce, try the more efficient imperative form instead. 您正在使用parallel_reduce的函数形式,请尝试使用更有效的命令式形式。 Unfortunately it cannot be called using lambdas, you must define a Body class: 不幸的是,不能使用lambda调用它,必须定义一个Body类:

https://www.threadingbuildingblocks.org/docs/help/reference/algorithms/parallel_reduce_func.html https://www.threadingbuildingblocks.org/docs/help/reference/algorithms/parallel_reduce_func.html

It should minimize the number vector4d copies that are made during your reduction. 它应减少还原过程中生成的vector4d副本数量。 The vector4d should be a member of your Body class so that it can be reused and appended to by multiple ranges, rather than constructing and merging a unique vector4d for every subdivided range. vector4d应该是您的Body类的成员,这样它才能被多个范围重用和附加,而不是为每个细分范围构造和合并唯一的vector4d。

(Note: the splitting constructor should NOT copy the contents of the vector4d member, notice how value is always initialized to 0 in intel's example above.) (注意:拆分构造函数不应复制vector4d成员的内容,请注意,在上面的intel示例中, value始终如何初始化为0。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM