如何使用TBB多线程“尾调用”递归

Question

I am trying to use tbb to multi-thread an existing recursive algorithm. 我试图使用tbb多线程现有的递归算法。 The single-thread version uses tail-call recursion, structurally it looks something like this: 单线程版本使用尾调用递归，从结构上看，它看起来像这样：

void my_func() {
    my_recusive_func (0);
}

bool doSomeWork (int i, int& a, int& b, int& c) {
    // do some work
}

void my_recusive_func (int i) {
    int a, b, c;
    bool notDone = doSomeWork (i, a, b, c);
    if (notDone) {
        my_recusive_func (a);
        my_recusive_func (b);
        my_recusive_func (c);
    }
}

I am a tbb novice so my first attempt used the parallel_invoke function: 我是tbb新手所以我的第一次尝试使用了parallel_invoke函数：

void my_recusive_func (int i) {
    int a, b, c;
    bool notDone = doSomeWork (i, a, b, c);
    if (notDone) {
        tbb::parallel_invoke (
                [a]{my_recusive_func (a);},
                [b]{my_recusive_func (b);},
                [c]{my_recusive_func (c);});
    }
}

This does work and it runs faster than the single-threaded version but it doesn't seem to scale well with number of cores. 这确实有效，并且运行速度比单线程版本快，但它似乎不能很好地扩展核心数量。 The machine I'm targeting has 16 cores (32 hyper-threads) so scalability is very important for this project, but this version only gets about 8 times speedup at best on that machine and many cores seem idle while the algorithm is running. 我所针对的机器有16个内核（32个超线程），因此可扩展性对于这个项目来说非常重要，但是这个版本在该机器上最多只能获得8倍的加速，并且许多内核在算法运行时似乎处于空闲状态。

My theory is that tbb is waiting for the child tasks to complete after the parallel_invoke so there may be many tasks sitting around idle waiting unnecessarily? 我的理论是tbb正在等待在parallel_invoke之后完成子任务，所以可能有许多任务闲置等待不必要？ Would this explain the idle cores? 这会解释空闲核心吗？ Is there any way to get the parent task to return without waiting for the children? 有没有办法让父任务返回而不等待孩子？ I was thinking perhaps something like this but I don't know enough about the scheduler yet to know if this is OK or not: 我当时想的可能是这样的，但我对调度程序还不了解，但还不知道这是否正常：

void my_func()
{
    tbb::task_group g;
    my_recusive_func (0, g);
    g.wait();
}

void my_recusive_func (int i, tbb::task_group& g) {
    int a, b, c;
    bool notDone = doSomeWork (i, a, b, c);
    if (notDone) {
        g.run([a,&g]{my_recusive_func(a, g);});
        g.run([b,&g]{my_recusive_func(b, g);});
        my_recusive_func (c, g);
    }
}

My first question is is tbb::task_group::run() thread-safe? 我的第一个问题是tbb::task_group::run()线程安全吗？ I couldn't figure that out from the documentation. 我无法从文档中找到答案。 Also, is there better way to go about this? 此外，还有更好的方法来解决这个问题吗？ Perhaps I should be using the low-level scheduler calls instead? 也许我应该使用低级调度程序调用？

(I typed this code without compiling so please forgive typos.) （我输入的代码没有编译，所以请原谅错别字。）

Answer 1

I'm fairly sure tbb::task_group::run() is thread-safe. 我很相信tbb::task_group::run()是线程安全的。 I can't find a mention in the documentation, which is quite surprising. 我在文档中找不到提及，这是相当令人惊讶的。

However, 然而，

This 2008 blog post contains a primitive implementation of task_group , whose run() method is clearly noted to be thread-safe. 这篇2008年的博客文章包含了task_group的原始实现，其run()方法被明确指出是线程安全的。 The current implementation is pretty similar. 目前的实施非常相似。
The testing code for tbb::task_group (in src/test/test_task_group.cpp ) comes with a test designed to test the thread-safety of task_group (it spawns a bunch of threads, each of which calls run() a thousand times or more on the same task_group object). tbb::task_group的测试代码（在src/test/test_task_group.cpp ）带有一个测试，用于测试task_group的线程安全性（它产生一堆线程，每个线程调用run()一千次或者更多关于同一task_group对象）。
The sudoku example code (in examples/task_group/sudoku/sudoku.cpp ) that comes with TBB also calls task_group::run from multiple threads in a recursive function, essentially the same way your proposed code is doing. TBB附带的sudoku示例代码（在examples/task_group/sudoku/sudoku.cpp ）也从递归函数中的多个线程调用task_group::run ，基本上与您提出的代码相同。
task_group is one of the features shared between TBB and Microsoft's PPL, whose task_group is thread-safe . task_group是TBB和Microsoft的PPL之间共享的功能之一，其task_group是线程安全的。 While the TBB documentation cautions that the behavior can still differ between the TBB and the PPL versions, it would be quite surprising if something as fundamental as thread-safety (and hence the need for external synchronization) is different. 虽然TBB文档提醒说TBB和PPL版本之间的行为仍然存在差异，但如果线程安全（因此需要外部同步）不同的话，那将是非常令人惊讶的。
tbb::structured_task_group (described as "like a task_group , but has only a subset of the functionality") has an explicit restriction that "Methods run , run_and_wait , cancel , and wait should be called only by the thread that created the structured_task_group ". tbb::structured_task_group （描述为“类似于task_group ，但只有一部分功能”）具有明确的限制，即“方法run ， run_and_wait ， cancel和wait应仅由创建structured_task_group的线程调用”。

Answer 2

There are really two questions here: 这里有两个问题：

Is the TBB implementation of task_group::run thread-safe? task_group :: TBB的TBB实现是否是线程安全的？ Yes. 是。 (We should document this more clearly). （我们应该更清楚地记录这一点）。
Is having many threads invoke method run() on the same task_group scalable? 有多个线程在同一个 task_group上调用方法run（）可伸缩吗？ No. (I believe the Microsoft documentation mentions this somewhere.) The reason is that the task_group becomes a centralized point of contention. 不。（我相信Microsoft文档在某处提到了这一点。）原因是task_group成为一个集中的争用点。 It's just a fetch-and-add in the implementation, but that's still ultimately unscalable since the affected cache line has to bounce around. 它只是实现中的一个获取和添加，但由于受影响的高速缓存行必须反弹，所以最终仍然是不可扩展的。

It's generally best to spawn a small number of tasks from a task_group. 通常最好从task_group中生成少量任务。 If using recursive parallelism, give each level its own task_group. 如果使用递归并行，请为每个级别提供自己的task_group。 Though the performance will likely not be any better than using parallel_invoke. 虽然性能可能不会比使用parallel_invoke更好。

The low-level tbb::task interfaces is the best bet. 低级tbb :: task接口是最好的选择。 You can even code the tail-recursion in that, using the trick where tasK::execute returns a pointer to the tail-call task. 您甚至可以使用tasK :: execute返回指向尾调用任务的指针的技巧来编写尾递归。

But I'm a bit concerned about the idling threads. 但我有点担心空转线程。 I'm wondering if there is enough work to keep the threads busy. 我想知道是否有足够的工作来保持线程繁忙。 Consider doing work-span analysis first. 首先考虑进行工作范围分析。 If you are using the Intel compiler (or gcc 4.9) you might try experimenting with a Cilk version first. 如果您使用的是英特尔编译器（或gcc 4.9），您可以先尝试使用Cilk版本。 If that won't speed up, then even the low-level tbb::task interface is unlikely to help, and higher-level issues (work and span) need to be examined. 如果这不会加速，那么即使是低级别的tbb :: task接口也不太可能有所帮助，需要检查更高级别的问题（工作和跨度）。

Answer 3

You could alternatively implement this as follows: 您也可以按如下方式实现：

constexpr int END = 10;
constexpr int PARALLEL_LIMIT = END - 4;
static void do_work(int i, int j) {
    printf("%d, %d\n", i, j);
}

static void serial_recusive_func(int i, int j) {
    // DO WORK HERE
    // ...
    do_work(i,j);
    if (i < END) {
        serial_recusive_func(i+1, 0);
        serial_recusive_func(i+1, 1);
        serial_recusive_func(i+1, 2);
    }
}

class RecursiveTask : public tbb::task {
    int i;
    int j;
public:
    RecursiveTask(int i, int j) :
        tbb::task(),
        i(i), j(j)
    {}
    task *execute() override {
        //DO WORK HERE
        //...
        do_work(i,j);
        if (i >= END) return nullptr;
        if (i < PARALLEL_LIMIT) {
            auto &c = *new (allocate_continuation()) tbb::empty_task();
            c.set_ref_count(3);
            spawn(*new(c.allocate_child()) RecursiveTask(i+1, 0));
            spawn(*new(c.allocate_child()) RecursiveTask(i+1, 1));
            recycle_as_child_of(c);
            i = i+1; j = 2;
            return this;
        } else {
            serial_recusive_func(i+1, 0);
            serial_recusive_func(i+1, 1);
            serial_recusive_func(i+1, 2);
        }
        return nullptr;
    }
};
static void my_func()
{
    tbb::task::spawn_root_and_wait(
        *new(tbb::task::allocate_root()) RecursiveTask(0, 0));
}
int main() {
    my_func();
}

Your question didn't include much information about the "do work here", so my implementation doesn't give do_work much opportunity to return a value or to affect the recursion. 你的问题没有包含很多关于“在这里工作”的信息，所以我的实现并没有给do_work很多机会来返回值或影响递归。 If you need that, you should edit your question to include a mention of what sort of effect "do work here" is expected to have on the overall computation. 如果您需要，您应该编辑您的问题，以便提及“在这里工作”会对整体计算产生什么样的影响。

如何使用TBB多线程“尾调用”递归

问题描述

3 个解决方案

解决方案1
3 2014-05-24 15:10:01

解决方案2
3 已采纳 2014-05-27 15:20:33

解决方案3
0 2014-05-24 15:00:48

如何使用TBB多线程“尾调用”递归

问题描述

3 个解决方案

解决方案1 3 2014-05-24 15:10:01

解决方案2 3 已采纳 2014-05-27 15:20:33

解决方案3 0 2014-05-24 15:00:48

解决方案1
3 2014-05-24 15:10:01

解决方案2
3 已采纳 2014-05-27 15:20:33

解决方案3
0 2014-05-24 15:00:48