简体   繁体   English

C ++:线程池比单线程慢?

[英]C++: Thread pool slower than single threading?

First of all I did look at the other topics on this website and found they don't relate to my problem as those mostly deal with people using I/O operations or thread creation overheads. 首先,我确实查看了本网站上的其他主题,发现它们与我的问题无关,因为那些主要涉及使用I / O操作或线程创建开销的人。 My problem is that my threadpool or worker-task structure implementation is (in this case) a lot slower than single threading. 我的问题是我的线程池或工作者任务结构实现(在这种情况下)比单线程慢很多。 I'm really confused by this and not sure if it's the ThreadPool, the task itself, how I test it, the nature of threads or something out of my control. 我真的很困惑,不确定它是ThreadPool,任务本身,我如何测试它,线程的性质或我无法控制的东西。

// Sorry for the long code
#include <vector>
#include <queue>

#include <thread>
#include <mutex>
#include <future>

#include "task.hpp"

class ThreadPool
{
public:
    ThreadPool()
    {
        for (unsigned i = 0; i < std::thread::hardware_concurrency() - 1; i++)
            m_workers.emplace_back(this, i);

        m_running = true;
        for (auto&& worker : m_workers)
            worker.start();
    }
    ~ThreadPool()
    {
        m_running = false;
        m_task_signal.notify_all();
        for (auto&& worker : m_workers)
            worker.terminate();
    }

    void add_task(Task* task)
    {
        {
            std::unique_lock<std::mutex> lock(m_in_mutex);
            m_in.push(task);
        }
        m_task_signal.notify_one();
    }
private:
    class Worker
    {
    public:
        Worker(ThreadPool* parent, unsigned id) : m_parent(parent), m_id(id)
        {}
        ~Worker()
        {
            terminate();
        }

        void start()
        {
            m_thread = new std::thread(&Worker::work, this);
        }
        void terminate()
        {
            if (m_thread)
            {
                if (m_thread->joinable())
                {
                    m_thread->join();
                    delete m_thread;
                    m_thread = nullptr;
                    m_parent = nullptr;
                }
            }
        }
    private:
        void work()
        {
            while (m_parent->m_running)
            {               
                std::unique_lock<std::mutex> lock(m_parent->m_in_mutex);
                m_parent->m_task_signal.wait(lock, [&]()
                {
                    return !m_parent->m_in.empty() || !m_parent->m_running;
                });

                if (!m_parent->m_running) break;
                Task* task = m_parent->m_in.front();
                m_parent->m_in.pop();
                // Fixed the mutex being locked while the task is executed
                lock.unlock();

                task->execute();            
            }
        }
    private:
        ThreadPool* m_parent = nullptr;
        unsigned m_id = 0;

        std::thread* m_thread = nullptr;
    };
private:
    std::vector<Worker> m_workers;

    std::mutex m_in_mutex;
    std::condition_variable m_task_signal;
    std::queue<Task*> m_in;

    bool m_running = false;
};

class TestTask : public Task
{
public:
    TestTask() {}
    TestTask(unsigned number) : m_number(number) {}

    inline void Set(unsigned number) { m_number = number; }

    void execute() override
    {
        if (m_number <= 3)
        {
            m_is_prime = m_number > 1;
            return;
        }
        else if (m_number % 2 == 0 || m_number % 3 == 0)
        {
            m_is_prime = false;
            return;
        }
        else
        {
            for (unsigned i = 5; i * i <= m_number; i += 6)
            {
                if (m_number % i == 0 || m_number % (i + 2) == 0)
                {
                    m_is_prime = false;
                    return;
                }
            }
            m_is_prime = true;
            return;
        }
    }
public:
    unsigned m_number = 0;
    bool m_is_prime = false;
};

int main()
{
    ThreadPool pool;

    unsigned num_tasks = 1000000;
    std::vector<TestTask> tasks(num_tasks);
    for (auto&& task : tasks)
        task.Set(randint(0, 1000000000));

    auto s = std::chrono::high_resolution_clock::now();
    #if MT
    for (auto&& task : tasks)
        pool.add_task(&task);
    #else
    for (auto&& task : tasks)
        task.execute();
    #endif
    auto e = std::chrono::high_resolution_clock::now();
    double seconds = std::chrono::duration_cast<std::chrono::nanoseconds>(e - s).count() / 1000000000.0;
}

Benchmarks with VS2013 Profiler: 使用VS2013 Profiler进行基准测试:

10,000,000 tasks:
    MT:
        13 seconds of wall clock time
        93.36% is spent in msvcp120.dll
        3.45% is spent in Task::execute() // Not good here
    ST:
        0.5 seconds of wall clock time
        97.31% is spent with Task::execute()

Usual disclaimer in such answers: the only way to tell for sure is to measure it with a profiler tool. 在这些答案中通常的免责声明:唯一可以确定的方法是使用分析器工具进行测量。

But I will try to explain your results without it. 但是如果没有它,我会尝试解释你的结果。 First of all, you have one mutex across all your threads. 首先,你的所有线程都有一个互斥锁。 So only one thread at a time can execute some task. 因此,一次只有一个线程可以执行某些任务。 It kills all your gains you might have. 它会杀死你可能获得的所有收益。 In spite of your threads your code is perfectly serial. 尽管你的线程,你的代码完全是串行的。 So at the very least make your task execution out of the mutex. 所以至少要让你的任务执行来自互斥锁。 You need to lock the mutex only to get a task out of the queue — you don't need to hold it when the task gets executed. 您只需锁定互斥锁即可将任务从队列中取出 - 您无需在任务执行时保留它。

Next, your tasks are so simple that single thread will execute them in no time. 接下来,您的任务非常简单,单个线程就可以立即执行它们。 You just can't measure any gains with such tasks. 你无法通过这些任务衡量任何收益。 Create some heavy tasks which could produce some more interesting results(some tasks which are closer to the real world, not such contrived). 创建一些繁重的任务,可以产生一些更有趣的结果(一些更接近现实世界的任务,而不是这样的设计)。

And the 3rd point: threads are not without their cost — context switching, mutex contention etc. To have real gains, as the previous 2 points say, you need to have tasks which take more time than the overheads threads introduce and the code should be truly parallel instead of waiting on some resource making it serial. 第三点:线程并非没有成本 - 上下文切换,互斥争用等。要获得真正的收益,如前两点所说,你需要的任务花费的时间比线程引入的开销要多,代码应该是真正并行而不是等待一些资源使它串行。

UPD: I looked at the wrong part of the code. UPD:我查看了错误的代码部分。 The task is complex enough provided you create tasks with sufficiently large numbers. 如果您创建具有足够大数量的任务,则任务非常复杂。


UPD2 : I've played with your code and found a good prime number to show how the MT code is better. UPD2 :我玩过您的代码并找到了一个很好的素数来展示MT代码是如何更好的。 Use the following prime number: 1019048297. It will give enough computation complexity to show the difference. 使用以下素数:1019048297.它将提供足够的计算复杂度来显示差异。

But why your code doesn't produce good results? 但为什么你的代码不能产生好的结果呢? It is hard to tell without seeing the implementation of randint() but I take it is pretty simple and in a half of the cases it returns even numbers and other cases produce not much of big prime numbers either. 没有看到randint()的实现很难说,但我认为它非常简单,在一半的情况下,它返回偶数和其他情况也不会产生太大的素数。 So the tasks are so simple that context switching and other things around your particular implementation and threads in general consume more time than the computation itself. 因此,任务非常简单,以至于上下文切换以及特定实现和线程周围的其他事情通常比计算本身消耗更多时间。 Using the prime number I gave you give the tasks no choice but spend time computing — no easy answer since the number is big and actually prime. 使用素数,我给你的任务别无选择,只花时间计算 - 没有简单的答案,因为数字很大,实际上是素数。 That's why the big number will give you the answer you seek — better time for the MT code. 这就是为什么大数字会给你你寻求的答案 - 更好的MT代码时间。

You should not hold the mutex while the task is getting executed, otherwise other threads will not be able to get a task: 在执行任务时不应该持有互斥锁,否则其他线程将无法获得任务:

void work() {
    while (m_parent->m_running) {   
        Task* currentTask = nullptr;    
        std::unique_lock<std::mutex> lock(m_parent->m_in_mutex);
        m_parent->m_task_signal.wait(lock, [&]() {
            return !m_parent->m_in.empty() || !m_parent->m_running;
        });                     
        if (!m_parent->m_running) continue;
        currentTask = m_parent->m_in.front();
        m_parent->m_in.pop();               
        lock.unlock(); //<- Release the lock so that other threads can get tasks
        currentTask->execute();
        currentTask = nullptr;
    }   
}       

For MT, how much time is spent in each phase of the "overhead": std::unique_lock , m_task_signal.wait , front , pop , unlock ? 对于MT,在“开销”的每个阶段花费了多少时间: std::unique_lockm_task_signal.waitfrontpopunlock

Based on your results of only 3% useful work, this means the above consumes 97%. 根据您仅有3%有用工作的结果,这意味着上述消耗97%。 I'd get numbers for each part of the above (eg add timestamps between each call). 我会得到上面每个部分的数字(例如,在每次调用之间添加时间戳)。

It seems to me, that the code you use to [merely] dequeue the next task pointer is quite heavy. 在我看来,你用来[仅]出列下一个任务指针的代码非常繁重。 I'd do a much simpler queue [possibly lockless] mechanism. 我会做一个更简单的队列[可能无锁]机制。 Or, perhaps, use atomics to bump an index into the queue instead of the five step process above. 或者,也许,使用atomics将索引压入队列而不是上面的五步过程。 For example: 例如:

void
work()
{
    while (m_parent->m_running) {
        // NOTE: this is just an example, not necessarily the real function
        int curindex = atomic_increment(&global_index);
        if (curindex >= max_index)
            break;

        Task *task = m_parent->m_in[curindex];

        task->execute();
    }
}

Also, maybe you should pop [say] ten at a time instead of just one. 此外,也许你应该一次弹出[说]十个而不是一个。

You might also be memory bound and/or "task switch" bound. 您也可能受内存限制和/或“任务切换”限制。 (eg) For threads that access an array, more than four threads usually saturates the memory bus. (例如)对于访问阵列的线程,超过四个线程通常会使内存总线饱和。 You could also have heavy contention for the lock, such that the threads get starved because one thread is monopolizing the lock [indirectly, even with the new unlock call] 你也可能有很大的争用锁定,这样线程就会变得匮乏,因为一个线程正在垄断锁[间接,即使是新的unlock调用]

Interthread locking usually involves a "serialization" operation where other cores must synchronize their out-of-order execution pipelines. 线程锁定通常涉及“序列化”操作,其他核心必须同步其无序执行流水线。

Here's a "lockless" implementation: 这是一个“无锁”的实现:

void
work()
{
    // assume m_id is 0,1,2,...
    int curindex = m_id;

    while (m_parent->m_running) {
        if (curindex >= max_index)
            break;

        Task *task = m_parent->m_in[curindex];

        task->execute();

        curindex += NUMBER_OF_WORKERS;
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM