简体   繁体   English

在Windows下创建和终止线程需要多长时间?

[英]How long does thread creation and termination take under Windows?

I've split a complex array processing task into a number of threads to take advantage of multi-core processing and am seeing great benefits. 我已将一个复杂的数组处理任务拆分为多个线程,以利用多核处理,并且看到了很多好处。 Currently, at the start of the task I create the threads, and then wait for them to terminate as they complete their work. 目前,在任务开始时我创建线程,然后在完成工作时等待它们终止。 I'm typically creating about four times the number of threads as there are cores, as each thread is liable to take a different amount of time, and having extra threads ensures all cores are kept occupied most of the time. 我通常创建的线程数量是核心数量的四倍,因为每个线程可能需要花费不同的时间,并且拥有额外的线程可确保所有核心在大多数时间内保持占用状态。 I was wondering would there be much of a performance advantage to creating the threads as the program fires up, keeping them idle until required, and using them as I start processing. 我想知道在程序启动时创建线程会有多大的性能优势,让它们保持空闲直到需要,并在我开始处理时使用它们。 Put more simply, how long does it take to start and end a new thread above and beyond the processing within the thread? 更简单地说,在线程内部处理之上和之后开始和结束新线程需要多长时间? I'm current starting the threads using 我现在正在使用线程

CWinThread *pMyThread = AfxBeginThread(CMyThreadFunc,&MyData,THREAD_PRIORITY_NORMAL);

Typically I will be using 32 threads across 8 cores on a 64 bit architecture. 通常,我将在64位架构上使用8个内核的32个线程。 The process in question currently takes < 1 second, and is fired up each time the display is refreshed. 该过程目前需要<1秒,并且每次刷新显示时都会启动。 If starting and ending a thread is < 1ms, the return doesn't justify the effort. 如果开始和结束一个线程<1ms,则返回不能证明这一点。 I'm having some difficulty profiling this. 我在分析这个问题时遇到了一些困难。

A related question here helps but is a bit vague for what I'm after. 这里相关问题有所帮助,但对于我所追求的内容有点模糊。 Any feedback appreciated. 任何反馈意见。

I wrote this quite a while ago when I had the same basic question (along with another that will be obvious). 我不久前写过这篇文章的时候,我有同样的基本问题(以及另一个显而易见的问题)。 I've updated it to show a little more about not only how long it takes to create threads, but how long it takes for the threads to start executing: 我更新了它,不仅展示了创建线程所需的时间,还展示了线程开始执行所需的时间:

#include <windows.h>
#include <iostream>
#include <time.h>
#include <vector>

const int num_threads = 32;

const int switches_per_thread = 100000;

DWORD __stdcall ThreadProc(void *start) {
    QueryPerformanceCounter((LARGE_INTEGER *) start);
    for (int i=0;i<switches_per_thread; i++)
        Sleep(0);
    return 0;
}

int main(void) {
    HANDLE threads[num_threads];
    DWORD junk;

    std::vector<LARGE_INTEGER> start_times(num_threads);

    LARGE_INTEGER l;
    QueryPerformanceCounter(&l);

    clock_t create_start = clock();
    for (int i=0;i<num_threads; i++)
        threads[i] = CreateThread(NULL, 
                            0, 
                            ThreadProc, 
                            (void *)&start_times[i], 
                            0, 
                            &junk);
    clock_t create_end = clock();

    clock_t wait_start = clock();
    WaitForMultipleObjects(num_threads, threads, TRUE, INFINITE);
    clock_t wait_end = clock();

    double create_millis = 1000.0 * (create_end - create_start) / CLOCKS_PER_SEC / num_threads;
    std::cout << "Milliseconds to create thread: " << create_millis << "\n";
    double wait_clocks = (wait_end - wait_start);
    double switches = switches_per_thread*num_threads;
    double us_per_switch = wait_clocks/CLOCKS_PER_SEC*1000000/switches;
    std::cout << "Microseconds per thread switch: " << us_per_switch;

    LARGE_INTEGER f;
    QueryPerformanceFrequency(&f);

    for (auto s : start_times) 
        std::cout << 1000.0 * (s.QuadPart - l.QuadPart) / f.QuadPart <<" ms\n";

    return 0;
}

Sample results: 样品结果:

Milliseconds to create thread: 0.015625
Microseconds per thread switch: 0.0479687

The first few thread start times look like this: 前几个线程启动时间如下所示:

0.0632517 ms
0.117348 ms
0.143703 ms
0.18282 ms
0.209174 ms
0.232478 ms
0.263826 ms
0.315149 ms
0.324026 ms
0.331516 ms
0.3956 ms
0.408639 ms
0.4214 ms

Note that although these happen to be monotonically increasing, that's not guaranteed (though there is definitely a trend in that general direction). 请注意,虽然这些都是单调递增的,但这并不能保证(尽管这个方向肯定存在趋势)。

When I first wrote this, the units I used made more sense -- on a 33 MHz 486, those results weren't tiny fractions like this. 当我第一次写这篇文章时,我使用的单位更有意义 - 在33 MHz 486上,这些结果并非像这样的小分数。 :-) I suppose someday when I'm feeling ambitious, I should rewrite this to use std::async to create the threads and std::chrono to do the timing, but... :-)我想有一天当我感到雄心勃勃时,我应该重写这个以使用std::async创建线程和std::chrono来做时间,但是......

Some advices: 一些建议:

  1. If you have lots of work items to process (or there aren't too many, but you have to repeat the whole process time to time), make sure you use some kind of thread pooling. 如果你有很多工作项要处理(或者没有太多工作项,但你必须不时地重复整个过程),请确保使用某种线程池。 This way you won't have to recreate the threads all the time, and your original question won't matter any more: the threads will be created only one time. 这样你就不必一直重新创建线程了,原来的问题就不再重要了:线程只会被创建一次。 I use the QueueUserWorkItem API directly (since my application doesn't use MFC), even that one is not too painful. 我直接使用QueueUserWorkItem API(因为我的应用程序不使用MFC),即使那个也不是太痛苦。 But in MFC you may have higher level facilities to take advantage of the thread pooling. 但是在MFC中,您可能拥有更高级别的设施来利用线程池。 ( http://support.microsoft.com/kb/197728 ) http://support.microsoft.com/kb/197728
  2. Try to select the optimal amount of work for one work item. 尝试为一个工作项选择最佳工作量。 Of course this depends on the feature of your software: is it supposed to be real time, or it's a number crunching in the background? 当然这取决于你的软件的功能:它应该是实时的,还是在后台运行的数字? If it's not real-time, then too small amount of work per work item can hurt performance: by increasing the proportion of overhead of the work distribution across threads. 如果它不是实时的,那么每个工作项的工作量太少会损害性能:通过增加跨线程的工作分配的开销比例。
  3. Since hardware configurations can be very different, if your end-users can have various machines you can include some calibration routines asynchronously during the start of the software, so you can estimate how much time certain operation takes. 由于硬件配置可能非常不同,如果最终用户可以拥有各种机器,则可以在软件启动期间异步包含一些校准程序,这样您就可以估计某些操作需要多长时间。 The result of the calibration can be an input for a better work size setting later for the real calculations. 校准的结果可以是稍后用于实际计算的更好的工作尺寸设置的输入。

I was curious about the modern Windows scheduler, so I wrote another test app. 我很好奇现代Windows调度程序,所以我写了另一个测试应用程序。 I made my best attempt at measuring thread stop time by optionally spinning up a watching thread. 我通过可选地启动观察线程,尽最大努力测量线程停止时间。

// Tested on Windows 10 v1903 with E5-1660 v3 @ 3.00GHz, 8 Core(s), 16 Logical Processor(s)
// Times are (min, average, max) in milliseconds.

threads: 100, iterations: 1, testStop: true
Start(0.1083, 5.3665, 13.7103) - Stop(0.0341, 1.5122, 11.0660)

threads: 32, iterations: 3, testStop: true
Start(0.1349, 1.6423, 3.5561) - Stop(0.0396, 0.2877, 3.5195)
Start(0.1093, 1.4992, 3.3982) - Stop(0.0351, 0.2734, 2.0384)
Start(0.1159, 1.5345, 3.5754) - Stop(0.0378, 0.4938, 3.2216)

threads: 4, iterations: 3, testStop: true
Start(0.2066, 0.3553, 0.4598) - Stop(0.0410, 0.1534, 0.4630)
Start(0.2769, 0.3740, 0.4994) - Stop(0.0414, 0.1028, 0.2581)
Start(0.2342, 0.3602, 0.5650) - Stop(0.0497, 0.2199, 0.3620)

threads: 4, iterations: 3, testStop: false
Start(0.1698, 0.2492, 0.3713)
Start(0.1473, 0.2477, 0.4103)
Start(0.1756, 0.2909, 0.4295)

threads: 1, iterations: 10, testStop: false
Start(0.1910, 0.1910, 0.1910)
Start(0.1685, 0.1685, 0.1685)
Start(0.1564, 0.1564, 0.1564)
Start(0.1504, 0.1504, 0.1504)
Start(0.1389, 0.1389, 0.1389)
Start(0.1234, 0.1234, 0.1234)
Start(0.1550, 0.1550, 0.1550)
Start(0.2800, 0.2800, 0.2800)
Start(0.1587, 0.1587, 0.1587)
Start(0.1877, 0.1877, 0.1877)

Source: 资源:

#include <windows.h>
#include <iostream>
#include <vector>
#include <chrono>
#include <iomanip>

using namespace std::chrono;

struct Test
{
    HANDLE Thread = { 0 };
    time_point<steady_clock> Creation;
    time_point<steady_clock> Started;
    time_point<steady_clock> Stopped;
};

DWORD __stdcall ThreadProc(void* lpParamater) {
    auto test = (Test*)lpParamater;
    test->Started = steady_clock::now();
    return 0;
}

DWORD __stdcall TestThreadsEnded(void* lpParamater) {
    auto& tests = *(std::vector<Test>*)lpParamater;

    std::size_t finished = 0;
    while (finished < tests.size())
    {
        for (auto& test : tests)
        {
            if (test.Thread != NULL && WaitForSingleObject(test.Thread, 0) == WAIT_OBJECT_0)
            {
                test.Stopped = steady_clock::now();
                test.Thread = NULL;
                finished++;
            }
        }
    }

    return 0;
}

duration<double, std::milli> diff(time_point<steady_clock> start, time_point<steady_clock> stop)
{
    return stop - start;
}

struct Stats
{
    double min;
    double average;
    double max;
};

Stats stats(const std::vector<double>& durations)
{
    Stats stats = { 1000, 0, 0 };

    for (auto& duration : durations)
    {
        stats.min = duration < stats.min ? duration : stats.min;
        stats.max = duration > stats.max ? duration : stats.max;
        stats.average += duration;
    }

    stats.average /= durations.size();

    return stats;
}

void TestScheduler(const int threadCount, const int iterations, const bool testStop)
{
    std::cout << "\nthreads: " << threadCount << ", iterations: " << iterations << ", testStop: " << (testStop ? "true" : "false") << "\n";

    for (auto i = 0; i < iterations; i++)
    {
        std::vector<Test> tests(threadCount);
        HANDLE testThreadsEnded = NULL;

        if (testStop)
        {
            testThreadsEnded = CreateThread(NULL, 0, TestThreadsEnded, (void*)& tests, 0, NULL);
        }

        for (auto& test : tests)
        {
            test.Creation = steady_clock::now();
            test.Thread = CreateThread(NULL, 0, ThreadProc, (void*)& test, 0, NULL);
        }

        if (testStop)
        {
            WaitForSingleObject(testThreadsEnded, INFINITE);
        }
        else
        {
            std::vector<HANDLE> threads;
            for (auto& test : tests) threads.push_back(test.Thread);
            WaitForMultipleObjects((DWORD)threads.size(), threads.data(), TRUE, INFINITE);
        }

        std::vector<double> startDurations;
        std::vector<double> stopDurations;
        for (auto& test : tests)
        {
            startDurations.push_back(diff(test.Creation, test.Started).count());
            stopDurations.push_back(diff(test.Started, test.Stopped).count());
        }

        auto startStats = stats(startDurations);
        auto stopStats = stats(stopDurations);

        std::cout << std::fixed << std::setprecision(4);
        std::cout << "Start(" << startStats.min << ", " << startStats.average << ", " << startStats.max << ")";
        if (testStop)
        {
            std::cout << " - ";
            std::cout << "Stop(" << stopStats.min << ", " << stopStats.average << ", " << stopStats.max << ")";
        }
        std::cout << "\n";
    }
}

int main(void)
{
    TestScheduler(100, 1, true);
    TestScheduler(32, 3, true);
    TestScheduler(4, 3, true);
    TestScheduler(4, 3, false);
    TestScheduler(1, 10, false);
    return 0;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM