简体   繁体   English

OpenMP 线程内代码执行时间的测量不正确

[英]Incorrect measurement of the code execution time inside OpenMP thread

So I need to measure execution time of some code inside for loop.所以我需要测量 for 循环中某些代码的执行时间。 Originally, I needed to measure several different activities, so I wrote a timer class to help me with that.最初,我需要测量几个不同的活动,所以我写了一个计时器 class 来帮助我。 After that I tried to speed things up by paralleling the for loop using OpenMP.之后,我尝试通过使用 OpenMP 并行 for 循环来加快速度。 The problem is that when running my code in parallel my time measurements become really different - the values increase approximately up to a factor of 10. So to avoid the possibility of flaw inside the timer class I started to measure execution time of the whole loop iteration, so structurally my code looks something like this:问题是,当并行运行我的代码时,我的时间测量值变得非常不同——值大约增加了 10 倍。因此,为了避免计时器 class 内部出现缺陷的可能性,我开始测量整个循环迭代的执行时间,所以在结构上我的代码看起来像这样:

#pragma omp parallel for num_threads(20)
for(size_t j = 0; j < entries.size(); ++j)
{
    auto t1 = std::chrono::steady_clock::now();
    // do stuff
    auto t2 = std::chrono::steady_clock::now();

    std::cout << "Execution time is " 
              << std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count()
              << std::endl;
}

Here are some examples of difference between measurements in parallel and measurements in single thread:以下是并行测量与单线程测量之间差异的一些示例:

Single-threaded单线程 Multi-threaded多线程
11.363545868021 11.363545868021 94.154685442 94.154685442
4.8963048650184 4.8963048650184 16.618173163 16.618173163
4.939025568 4.939025568 25.4751074 25.4751074
18.447368772 18.447368772 110.709813843 110.709813843

Even though it is only a couple of examples, this behaviour seems to prevail in all loop iterations.尽管这只是几个例子,但这种行为似乎在所有循环迭代中都很普遍。 I also tried to use boost's chrono library and thread_clock but got the same result.我也尝试使用 boost 的 chrono 库和thread_clock但得到了相同的结果。 Do I misunderstand something?我误解了什么吗? What may be the cause of this?这可能是什么原因? Maybe I get cumulative time of all threads?也许我得到所有线程的累计时间?

Inside the for loop, during each iteration I read a different file.在 for 循环中,在每次迭代期间我读取不同的文件。 Based on this file I create and solve multitude of mixed-integer optimisation models.基于此文件,我创建并求解了大量混合整数优化模型。 I solve them with MIP solver, which I set to run in one thread.我使用 MIP 求解器解决它们,我将其设置为在一个线程中运行。 The instance of the solver is created on each iteration.求解器的实例在每次迭代时创建。 The only variables that is shared between iteration are constant strings which represents paths to some directories.在迭代之间共享的唯一变量是表示某些目录路径的常量字符串。

My machine has 32 threads (16 cores, 2 threads per core).我的机器有 32 个线程(16 个内核,每个内核 2 个线程)。

Also here are the timings of the whole application in single-threaded mode:这里还有整个应用程序在单线程模式下的时间:

real    23m17.763s
user    21m46.284s
sys     1m28.187s

and in multi-threaded mode:在多线程模式下:

real    12m47.657s
user    156m20.479s
sys     2m34.311s

A few points here.这里有几点。

What you're measuring corresponds (roughly) to what time returns as the user time--that is total CPU time consumed by all threads.您正在测量的内容(大致)对应于作为user time返回的时间——即所有线程消耗的总 CPU 时间。 But when we look at the real time reported by time , we see that your multithreaded code is running close to twice as fast as the single threaded code.但是当我们查看 time 报告的real时间time ,我们看到您的多线程代码运行速度接近单线程代码的两倍。 So it is scaling to some degree--but not very well.所以它在某种程度上正在扩展——但不是很好。

Reading a file in the parallel region may well be part of this.读取并行区域中的文件很可能是其中的一部分。 Even at best, the fastest NVMe SSDs can only support reading from a few (eg, around three or four) threads concurrently before you're using the drive's entire available bandwidth (and if you're doing I/O efficiently that may well be closer to 2. If you're using an actual spinning hard drive, it's usually pretty easy for a single thread to saturate the drive's bandwidth. A PCIe 5 SSD should keep up with more threads, but I kind of doubt even it has the bandwidth to feed 20 threads.即使在最好的情况下,最快的 NVMe SSD 也只能支持在您使用驱动器的整个可用带宽之前同时从几个(例如,大约三个或四个)线程读取(如果您正在高效地执行 I/O,那很可能是接近 2。如果您使用的是实际的旋转硬盘驱动器,单线程通常很容易使驱动器的带宽饱和。PCIe 5 SSD 应该跟上更多线程,但我有点怀疑它是否有带宽馈送 20 个线程。

Depending on what parts of the standard library you're using, it's pretty easy to have some "invisible" shared variables.根据您使用的标准库的哪些部分,拥有一些“不可见”的共享变量非常容易。 For one common example, quite code that uses Monte Carlo methods will frequently have calls to rand() .对于一个常见的示例,使用蒙特卡洛方法的相当多的代码将经常调用rand() Even though it looks like a normal function call, rand() will typically end up using a seed variable that's shared between threads, and every call to rand() not only reads but also writes to that shared variable--so the calls to rand() all end up serialized.尽管它看起来像一个正常的 function 调用,但rand()通常最终会使用一个在线程之间共享的种子变量,并且每次调用rand()不仅读取而且写入该共享变量——因此调用rand()全部结束序列化。

You mention your MIP solver running in a single thread, but say there's a separate instance per thread, leaving it unclear whether the MIP solving code is really one thread shared between the 20 other threads, or that you have one MIP solver instance running in each of the 20 threads.您提到您的 MIP 求解器在单个线程中运行,但是说每个线程都有一个单独的实例,因此不清楚 MIP 求解代码是否真的是在其他 20 个线程之间共享的一个线程,或者您有一个 MIP 求解器实例在每个线程中运行的 20 个线程。 I'd guess the latter, but if it's really the former, then it's being a bottleneck wouldn't seem surprising at all.我猜是后者,但如果它真的是前者,那么它成为瓶颈就不足为奇了。

Without code to look at, it's impossible to get really specific though.如果没有代码可看,就不可能真正具体化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM