[英]tbb::parallel_reduce vs tbb::combinable vs tbb::enumerable_thread_specific
I want to go through an image and process some specific values with regard to the order of the elements. 我想浏览一幅图像并处理有关元素顺序的一些特定值。 The image has one unsigned char*
array containing a mask(255 if pixel should be processed, else 0) and an unsigned short*
array with the pixel values. 该图像具有一个unsigned char*
数组,该数组包含一个掩码(如果应处理像素,则为255;否则为0);以及一个具有像素值的unsigned short*
数组。
I implemented three different methods with tbb and used a single for-loop through the mask-array and calculated the x,y-coordinates from the loop-variable: x = i%width; y = i/width;
我用tbb实现了三种不同的方法,并在掩码数组中使用了一个for循环,并根据循环变量计算了x,y坐标: x = i%width; y = i/width;
x = i%width; y = i/width;
. 。 If the pixel is visible i want to transform the point using Eigen
. 如果像素可见,我想使用Eigen
变换点。 The vector4d
is a std::vector<std::array<double,4>>
to store the points. vector4d
是std::vector<std::array<double,4>>
用于存储点。
Here are my three implementaion with tbb: 这是我用tbb实现的三个方法:
1. tbb::combinable
and tbb::parallel_for
: 1. tbb::combinable
和tbb::parallel_for
:
void Combinable(int width, int height, unsigned char* mask,unsigned short* pixel){
MyCombinableType.clear();
MyCombinableType.local().reserve(width*height);
tbb::parallel_for( tbb::blocked_range<int>(0, width*height),
[&](const tbb::blocked_range<int> &r)
{
vector4d& local = MyCombinableType.local();
const size_t end = r.end();
for (int i = r.begin(); i != end; ++i)
{
if(mask[i]!=0)
{
array4d arr = {i%width,i/width,(double)pixel[i],1};
//Map with Eigen and transform
local.push_back(arr);
}
}
});
vector4d idx = MyCombinableType.combine(
[]( vector4d x, vector4d y)
{
std::size_t n = x.size();
x.resize(n + y.size());
std::move(y.begin(), y.end(), x.begin() + n);
return x;
});
}
2. tbb::enumerable_thread_specific
and tbb::parallel_for
: 2. tbb::enumerable_thread_specific
和tbb::parallel_for
:
void Enumerable(int width, int height, unsigned char* mask,unsigned short* pixel){
MyEnumerableType.clear();
MyEnumerableType.local().reserve(width*height);
tbb::parallel_for( tbb::blocked_range<int>(0, width*height),
[&](const tbb::blocked_range<int> &r)
{
enumerableType::reference local = MyEnumerableType.local();
for (int i = r.begin(); i != r.end(); ++i)
{
if(mask[i]!=0)
{
array4d arr = {i%width,i/width,(double)pixel[i],1};
//Map with Eigen and transform
local.push_back(arr);
}
}
});
vector4d idx = MyEnumerableType.combine(
[](vector4d x, vector4d y)
{
std::size_t n = x.size();
x.resize(n + y.size());
std::move(y.begin(), y.end(), x.begin() + n);
return x;
});
}
3. tbb::parallel_reduce
: 3. tbb::parallel_reduce
:
void Reduce(int width, int height, unsigned char* mask,unsigned short* pixel){
vector4d idx = tbb::parallel_reduce(
tbb::blocked_range<int>(0, width*height ),vector4d(),
[&](const tbb::blocked_range<int>& r, vector4d init)->vector4d
{
const size_t end = r.end();
init.reserve(r.size());
for( int i=r.begin(); i!=end; ++i )
{
if(mask[i]!=0)
{
array4d arr = {i%width,i/width,(double)pixel[i],1};
//Map with Eigen and transform
init.push_back(arr);
}
}
return init;
},
[]( vector4d x,vector4d y )
{
std::size_t n = x.size();
x.resize(n + y.size());
std::move(y.begin(), y.end(), x.begin() + n);
return x;
}
);
}
I compared the runtime of the three versions with a serial implementation. 我将这三个版本的运行时与串行实现进行了比较。 The arrays had 8400000 elements and every algortihm was repeated 100 times. 数组有840万个元素,每个算法重复100次。 The results are: 结果是:
I assume that the combine
statement is the bottleneck here. 我认为combine
语句是这里的瓶颈。 What am i doing wrong? 我究竟做错了什么? Why is parallel_reduce
soo much slower? 为什么parallel_reduce
这么慢? Please help! 请帮忙!
There are few optimizations you can apply here. 您可以在此处应用一些优化。
const vector4d&
instead, use [&]
lambdas everywhere. 避免过度复制:通过const vector4d&
代替,在任何地方都使用[&]
lambda。 vector4d
on stack instead of resizing one of the arguments and use it for return statement. 在堆栈上使用临时的vector4d
,而不是调整参数之一的大小并将其用于return语句。 blocked_range2d
instead of calculating x = i%width; y = i/width
通常,使用blocked_range2d
代替计算x = i%width; y = i/width
x = i%width; y = i/width
. x = i%width; y = i/width
。 This is not only optimizes out excessive computations but, which is much more important, it optimizes cache access pattern that might improve cache usage (not in this case though). 这不仅可以优化过多的计算,而且更为重要的是,它还可以优化可能会提高缓存使用率的缓存访问模式(尽管在这种情况下不是)。 You are using the functional form of parallel_reduce, try the more efficient imperative form instead. 您正在使用parallel_reduce的函数形式,请尝试使用更有效的命令式形式。 Unfortunately it cannot be called using lambdas, you must define a Body class: 不幸的是,不能使用lambda调用它,必须定义一个Body类:
https://www.threadingbuildingblocks.org/docs/help/reference/algorithms/parallel_reduce_func.html https://www.threadingbuildingblocks.org/docs/help/reference/algorithms/parallel_reduce_func.html
It should minimize the number vector4d copies that are made during your reduction. 它应减少还原过程中生成的vector4d副本数量。 The vector4d should be a member of your Body class so that it can be reused and appended to by multiple ranges, rather than constructing and merging a unique vector4d for every subdivided range. vector4d应该是您的Body类的成员,这样它才能被多个范围重用和附加,而不是为每个细分范围构造和合并唯一的vector4d。
(Note: the splitting constructor should NOT copy the contents of the vector4d member, notice how value
is always initialized to 0 in intel's example above.) (注意:拆分构造函数不应复制vector4d成员的内容,请注意,在上面的intel示例中, value
始终如何初始化为0。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.