[英]How to multithread “tail call” recursion using TBB
I am trying to use tbb to multi-thread an existing recursive algorithm. 我试图使用tbb多线程现有的递归算法。 The single-thread version uses tail-call recursion, structurally it looks something like this:
单线程版本使用尾调用递归,从结构上看,它看起来像这样:
void my_func() {
my_recusive_func (0);
}
bool doSomeWork (int i, int& a, int& b, int& c) {
// do some work
}
void my_recusive_func (int i) {
int a, b, c;
bool notDone = doSomeWork (i, a, b, c);
if (notDone) {
my_recusive_func (a);
my_recusive_func (b);
my_recusive_func (c);
}
}
I am a tbb novice so my first attempt used the parallel_invoke function: 我是tbb新手所以我的第一次尝试使用了parallel_invoke函数:
void my_recusive_func (int i) {
int a, b, c;
bool notDone = doSomeWork (i, a, b, c);
if (notDone) {
tbb::parallel_invoke (
[a]{my_recusive_func (a);},
[b]{my_recusive_func (b);},
[c]{my_recusive_func (c);});
}
}
This does work and it runs faster than the single-threaded version but it doesn't seem to scale well with number of cores. 这确实有效,并且运行速度比单线程版本快,但它似乎不能很好地扩展核心数量。 The machine I'm targeting has 16 cores (32 hyper-threads) so scalability is very important for this project, but this version only gets about 8 times speedup at best on that machine and many cores seem idle while the algorithm is running.
我所针对的机器有16个内核(32个超线程),因此可扩展性对于这个项目来说非常重要,但是这个版本在该机器上最多只能获得8倍的加速,并且许多内核在算法运行时似乎处于空闲状态。
My theory is that tbb is waiting for the child tasks to complete after the parallel_invoke so there may be many tasks sitting around idle waiting unnecessarily? 我的理论是tbb正在等待在parallel_invoke之后完成子任务,所以可能有许多任务闲置等待不必要? Would this explain the idle cores?
这会解释空闲核心吗? Is there any way to get the parent task to return without waiting for the children?
有没有办法让父任务返回而不等待孩子? I was thinking perhaps something like this but I don't know enough about the scheduler yet to know if this is OK or not:
我当时想的可能是这样的,但我对调度程序还不了解,但还不知道这是否正常:
void my_func()
{
tbb::task_group g;
my_recusive_func (0, g);
g.wait();
}
void my_recusive_func (int i, tbb::task_group& g) {
int a, b, c;
bool notDone = doSomeWork (i, a, b, c);
if (notDone) {
g.run([a,&g]{my_recusive_func(a, g);});
g.run([b,&g]{my_recusive_func(b, g);});
my_recusive_func (c, g);
}
}
My first question is is tbb::task_group::run()
thread-safe? 我的第一个问题是
tbb::task_group::run()
线程安全吗? I couldn't figure that out from the documentation. 我无法从文档中找到答案。 Also, is there better way to go about this?
此外,还有更好的方法来解决这个问题吗? Perhaps I should be using the low-level scheduler calls instead?
也许我应该使用低级调度程序调用?
(I typed this code without compiling so please forgive typos.) (我输入的代码没有编译,所以请原谅错别字。)
I'm fairly sure tbb::task_group::run()
is thread-safe. 我很相信
tbb::task_group::run()
是线程安全的。 I can't find a mention in the documentation, which is quite surprising. 我在文档中找不到提及,这是相当令人惊讶的。
However, 然而,
task_group
, whose run()
method is clearly noted to be thread-safe. task_group
的原始实现,其run()
方法被明确指出是线程安全的。 The current implementation is pretty similar. tbb::task_group
(in src/test/test_task_group.cpp
) comes with a test designed to test the thread-safety of task_group
(it spawns a bunch of threads, each of which calls run()
a thousand times or more on the same task_group
object). tbb::task_group
的测试代码(在src/test/test_task_group.cpp
)带有一个测试,用于测试task_group
的线程安全性(它产生一堆线程,每个线程调用run()
一千次或者更多关于同一task_group
对象)。 sudoku
example code (in examples/task_group/sudoku/sudoku.cpp
) that comes with TBB also calls task_group::run
from multiple threads in a recursive function, essentially the same way your proposed code is doing. sudoku
示例代码(在examples/task_group/sudoku/sudoku.cpp
)也从递归函数中的多个线程调用task_group::run
,基本上与您提出的代码相同。 task_group
is one of the features shared between TBB and Microsoft's PPL, whose task_group
is thread-safe . task_group
是TBB和Microsoft的PPL之间共享的功能之一,其task_group
是线程安全的 。 While the TBB documentation cautions that the behavior can still differ between the TBB and the PPL versions, it would be quite surprising if something as fundamental as thread-safety (and hence the need for external synchronization) is different. tbb::structured_task_group
(described as "like a task_group
, but has only a subset of the functionality") has an explicit restriction that "Methods run
, run_and_wait
, cancel
, and wait
should be called only by the thread that created the structured_task_group
". tbb::structured_task_group
(描述为“类似于task_group
,但只有一部分功能”)具有明确的限制,即“方法run
, run_and_wait
, cancel
和wait
应仅由创建structured_task_group
的线程调用”。 There are really two questions here: 这里有两个问题:
It's generally best to spawn a small number of tasks from a task_group. 通常最好从task_group中生成少量任务。 If using recursive parallelism, give each level its own task_group.
如果使用递归并行,请为每个级别提供自己的task_group。 Though the performance will likely not be any better than using parallel_invoke.
虽然性能可能不会比使用parallel_invoke更好。
The low-level tbb::task interfaces is the best bet. 低级tbb :: task接口是最好的选择。 You can even code the tail-recursion in that, using the trick where tasK::execute returns a pointer to the tail-call task.
您甚至可以使用tasK :: execute返回指向尾调用任务的指针的技巧来编写尾递归。
But I'm a bit concerned about the idling threads. 但我有点担心空转线程。 I'm wondering if there is enough work to keep the threads busy.
我想知道是否有足够的工作来保持线程繁忙。 Consider doing work-span analysis first.
首先考虑进行工作范围分析 。 If you are using the Intel compiler (or gcc 4.9) you might try experimenting with a Cilk version first.
如果您使用的是英特尔编译器(或gcc 4.9),您可以先尝试使用Cilk版本。 If that won't speed up, then even the low-level tbb::task interface is unlikely to help, and higher-level issues (work and span) need to be examined.
如果这不会加速,那么即使是低级别的tbb :: task接口也不太可能有所帮助,需要检查更高级别的问题(工作和跨度)。
You could alternatively implement this as follows: 您也可以按如下方式实现:
constexpr int END = 10;
constexpr int PARALLEL_LIMIT = END - 4;
static void do_work(int i, int j) {
printf("%d, %d\n", i, j);
}
static void serial_recusive_func(int i, int j) {
// DO WORK HERE
// ...
do_work(i,j);
if (i < END) {
serial_recusive_func(i+1, 0);
serial_recusive_func(i+1, 1);
serial_recusive_func(i+1, 2);
}
}
class RecursiveTask : public tbb::task {
int i;
int j;
public:
RecursiveTask(int i, int j) :
tbb::task(),
i(i), j(j)
{}
task *execute() override {
//DO WORK HERE
//...
do_work(i,j);
if (i >= END) return nullptr;
if (i < PARALLEL_LIMIT) {
auto &c = *new (allocate_continuation()) tbb::empty_task();
c.set_ref_count(3);
spawn(*new(c.allocate_child()) RecursiveTask(i+1, 0));
spawn(*new(c.allocate_child()) RecursiveTask(i+1, 1));
recycle_as_child_of(c);
i = i+1; j = 2;
return this;
} else {
serial_recusive_func(i+1, 0);
serial_recusive_func(i+1, 1);
serial_recusive_func(i+1, 2);
}
return nullptr;
}
};
static void my_func()
{
tbb::task::spawn_root_and_wait(
*new(tbb::task::allocate_root()) RecursiveTask(0, 0));
}
int main() {
my_func();
}
Your question didn't include much information about the "do work here", so my implementation doesn't give do_work
much opportunity to return a value or to affect the recursion. 你的问题没有包含很多关于“在这里工作”的信息,所以我的实现并没有给
do_work
很多机会来返回值或影响递归。 If you need that, you should edit your question to include a mention of what sort of effect "do work here" is expected to have on the overall computation. 如果您需要,您应该编辑您的问题,以便提及“在这里工作”会对整体计算产生什么样的影响。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.