简体繁体 English

c ++中的显式代码并行性

[英]Explicit code parallelism in c++

原文 2008-09-26 21:58:35 2 8 c++/ performance/ design-patterns

Out of order execution in CPUs means that a CPU can reorder instructions to gain better performance and it means the CPU is having to do some very nifty bookkeeping and such. CPU中的乱序执行意味着CPU可以重新排序指令以获得更好的性能，这意味着CPU必须做一些非常漂亮的簿记等。 There are other processor approaches too, such as hyper-threading. 还有其他处理器方法，例如超线程。

Some fancy compilers understand the (un)interrelatedness of instructions to a limited extent, and will automatically interleave instruction flows (probably over a longer window than the CPU sees) to better utilise the processor. 一些花哨的编译器在有限的程度上理解指令的（非）相关性，并且将自动交错指令流（可能在比CPU看到的更长的窗口上）以更好地利用处理器。 Deliberate compile-time interleaving of floating and integer instructions is another example of this. 浮动和整数指令的有意编译时交织是另一个例子。

Now I have highly-parallel task. 现在我有高度并行的任务。 And I typically have an ageing single-core x86 processor without hyper-threading. 我通常拥有老化的单核x86处理器，没有超线程。

Is there a straight-forward way to get my the body of my 'for' loop for this highly-parallel task to be interleaved so that two (or more) iterations are being done together? 是否有一种直接的方法来让我的'for'循环的主体为这个高度并行的任务交错，以便两个（或更多）迭代一起完成？ (This is slightly different from 'loop unwinding' as I understand it.) （这与我理解的'循环展开'略有不同。）

My task is a 'virtual machine' running through a set of instructions, which I'll really simplify for illustration as: 我的任务是运行一系列指令的“虚拟机”，我将真正简化为例：

void run(int num) {
  for(int n=0; n<num; n++) {
     vm_t data(n);
     for(int i=0; i<data.len(); i++) {
        data.insn(i).parse();
        data.insn(i).eval();
     }
  }  
}

So the execution trail might look like this: 因此执行跟踪可能如下所示：

data(1) insn(0) parse
data(1) insn(0) eval
data(1) insn(1) parse
...
data(2) insn(1) eval
data(2) insn(2) parse
data(2) insn(2) eval

Now, what I'd like is to be able to do two (or more) iterations explicitly in parallel: 现在，我想要的是能够并行显式地进行两次（或更多次）迭代：

data(1) insn(0) parse
data(2) insn(0) parse  \ processor can do OOO as these two flow in
data(1) insn(0) eval   /
data(2) insn(0) eval   \ OOO opportunity here too
data(1) insn(1) parse  /
data(2) insn(1) parse

I know, from profiling, (eg using Callgrind with --simulate-cache=yes), that parsing is about random memory accesses (cache missing) and eval is about doing ops in registers and then writing results back. 我知道，从分析中，（例如使用带有--simulate-cache = yes的Callgrind），解析是关于随机存储器访问（缓存丢失），而eval是关于在寄存器中执行操作然后再写回结果。 Each step is several thousand instructions long. 每一步都是几千条指令。 So if I can intermingle the two steps for two iterations at once, the processor will hopefully have something to do whilst the cache misses of the parse step are occurring... 因此，如果我可以将这两个步骤同时进行两次迭代混合，那么处理器将有希望在解析步骤的缓存未命中时执行某些操作...

Is there some c++ template madness to get this kind of explicit parallelism generated? 是否有一些c ++模板疯狂才能产生这种显式并行性？

Of course I can do the interleaving - and even staggering - myself in code, but it makes for much less readable code. 当然，我可以在代码中进行交错 - 甚至是交错 - 但它会使代码的可读性降低。 And if I really want unreadable, I can go so far as assembler! 如果我真的想要不可读，我可以做汇编程序！ But surely there is some pattern for this kind of thing? 但这种事情肯定有一些模式吗？

8 个解决方案

鉴于优化编译器和流水线处理器，我建议您只编写清晰易读的代码。

Your best plan may be to look into OpenMP . 您最好的计划可能是研究OpenMP 。 It basically allows you to insert "pragmas" into your code which tell the compiler how it can split between processors. 它基本上允许您在代码中插入“pragma”，告诉编译器如何在处理器之间进行分割。

Hyperthreading is a much higher-level system than instruction reordering. 超线程是一种比指令重新排序更高级别的系统。 It makes the processor look like two processors to the operating system, so you'd need to use an actual threading library to take advantage of that. 它使处理器看起来像操作系统的两个处理器，因此您需要使用实际的线程库来利用它。 The same thing naturally applies to multicore processors. 同样的事情自然也适用于多核处理器。

If you don't want to use low-level threading libraries and instead want to use a task-based parallel system (and it sounds like that's what you're after) I'd suggest looking at OpenMP or Intel's Threading Building Blocks . 如果您不想使用低级线程库而是想使用基于任务的并行系统（听起来就像您所追求的那样），我建议您查看OpenMP或英特尔的线程构建模块。

TBB is a library, so it can be used with any modern C++ compiler. TBB是一个库，因此它可以与任何现代C ++编译器一起使用。 OpenMP is a set of compiler extensions, so you need a compiler that supports it. OpenMP是一组编译器扩展，因此您需要一个支持它的编译器。 GCC/G++ will from verion 4.2 and newer. GCC / G ++将从版本4.2和更新版本开始。 Recent versions of the Intel and Microsoft compilers also support it. 最新版本的英特尔和微软编译器也支持它。 I don't know about any others, though. 不过，我不知道其他任何人。

EDIT: One other note. 编辑：另一个说明。 Using a system like TBB or OpenMP will scale the processing as much as possible - that is, if you have 100 objects to work on, they'll get split about 50/50 in a two-core system, 25/25/25/25 in a four-core system, etc. 使用像TBB或OpenMP这样的系统将尽可能地扩展处理 - 也就是说，如果你有100个对象可以工作，它们将在双核系统中分成大约50/50，25/25/25 / 25个四核系统等

Modern processors like the Core 2 have an enormous instruction reorder buffer on the order of nearly 100 instructions; 像Core 2这样的现代处理器具有大约100条指令的大量指令重新排序缓冲区; even if the compiler is rather dumb the CPU can still make up for it. 即使编译器相当愚蠢，CPU仍然可以弥补它。

The main issue would be if the code used a lot of registers, in which case the register pressure could force the code to be executed in sequence even if theoretically it could be done in parallel. 主要问题是如果代码使用了大量寄存器，在这种情况下，寄存器压力可能会迫使代码按顺序执行，即使理论上它可以并行执行。

There is no support for parallel execution in the current C++ standard. 当前的C ++标准中不支持并行执行。 This will change for the next version of the standard, due out next year or so. 这将在明年即将发布的下一版标准中进行更改。

However, I don't see what you are trying to accomplish. 但是，我看不出你想要完成什么。 Are you referring to one single-core processor, or multiple processors or cores? 您指的是一个单核处理器，还是多个处理器或核心？ If you have only one core, you should do whatever gets the fewest cache misses, which means whatever approach uses the smallest memory working set. 如果你只有一个核心，你应该做任何最少的缓存未命中，这意味着无论什么方法使用最小的内存工作集。 This would probably be either doing all the parsing followed by all the evaluation, or doing the parsing and evaluation alternately. 这可能要么进行所有解析，然后进行所有评估，要么交替进行解析和评估。

If you have two cores, and want to use them efficiently, you're going to have to either use a particularly smart compiler or language extensions. 如果您有两个核心，并且想要有效地使用它们，那么您将不得不使用特别智能的编译器或语言扩展。 Is there one particular operating system you're developing for, or should this be for multiple systems? 您正在开发一种特定的操作系统，还是应该用于多个系统？

It sounds like you ran into the same problem chip designers face: Executing a single instruction takes a lot of effort, but it involves a bunch of different steps that can be strung together in an execution pipeline . 听起来你遇到了芯片设计人员面临的同样问题：执行单个指令需要付出很多努力，但它涉及到许多不同的步骤，这些步骤可以在执行管道中串联起来。 (It is easier to execute things in parallel when you can build them out of separate blocks of hardware.) （当您可以使用单独的硬件块构建它们时，并行执行更容易。）

The most obvious way is to split each task into different threads. 最明显的方法是将每个任务分成不同的线程。 You might want to create a single thread to execute each instruction to completion, or create one thread for each of your two execution steps and pass data between them. 您可能希望创建单个线程来执行每个指令以完成，或者为两个执行步骤中的每一个创建一个线程并在它们之间传递数据。 In either case, you'll have to be very careful with how you share data between threads and make sure to handle the case where one instruction affects the result of the following instruction. 在任何一种情况下，您都必须非常小心如何在线程之间共享数据，并确保处理一条指令影响以下指令结果的情况。 Even though you only have one core and only one thread can be running at any given time, your operating system should be able to schedule compute-intense threads while other threads are waiting for their cache misses. 即使您只有一个核心，并且在任何给定时间只能运行一个线程，您的操作系统应该能够调度计算密集型线程，而其他线程正在等待其缓存未命中。

(A few hours of your time would probably pay for a single very fast computer, but if you're trying to deploy it widely on cheap hardware it might make sense to consider the problem the way you're looking at it. Regardless, it's an interesting problem to consider.) （几个小时的时间可能会花费一台非常快的计算机，但是如果你试图在廉价的硬件上广泛部署它，那么你可以按照你的方式考虑问题是有意义的。无论如何，它是一个有趣的问题需要考虑。）

Take a look at cilk . 看看cilk 。 It's an extension to ANSI C that has some nice constructs for writing parallelized code in C. However, since it's an extension of C, it has very limited compiler support, and can be tricky to work with. 它是ANSI C的扩展，它有一些很好的构造用于在C中编写并行化代码。但是，由于它是C的扩展，因此它具有非常有限的编译器支持，并且可能很难处理。

This answer was written assuming the questions does not contain the part "And I typically have an ageing single-core x86 processor without hyper-threading.". 这个答案是在假设问题不包含“我通常拥有老化的单核x86处理器而没有超线程”的情况下编写的。 I hope it might help other people who want to parallelize highly-parallel tasks, but target dual/multicore CPUs. 我希望它可以帮助其他想要并行化高度并行任务但是针对双核/多核CPU的人。

As already posted in another answer , OpenMP is a portable way how to do this. 正如已经在另一个答案中发布的那样，OpenMP是一种可移植的方式。 However my experience is OpenMP overhead is quite high and it is very easy to beat it by rolling a DIY (Do It Youself) implementation. 不过我的经验是OpenMP开销非常高，通过滚动DIY（自己动手）实现很容易打败它。 Hopefully OpenMP will improve over time, but as it is now, I would not recommend using it for anything else than prototyping. 希望OpenMP会随着时间的推移而改进，但就像现在一样，我不建议将其用于除原型之外的任何其他工作。

Given the nature of your task, What you want to do is most likely a data based parallelism, which in my experience is quite easy - the programming style can be very similar to a single-core code, because you know what other threads are doing, which makes maintaining thread safety a lot easier - an approach which worked for me: avoid dependencies and call only thread safe functions from the loop. 鉴于您的任务的性质，您想要做的很可能是基于数据的并行性，根据我的经验，这很简单 - 编程风格可能非常类似于单核代码，因为您知道其他线程在做什么这使得维护线程安全变得更加容易 - 这种方法对我有用：避免依赖，并且只从循环中调用线程安全函数。

To create a DYI OpenMP parallel loop you need to: 要创建DYI OpenMP并行循环，您需要：

as a preparation create a serial for loop template and change your code to use functors to implement the loop bodies. 作为准备创建一个循环模板的序列并更改您的代码以使用仿函数来实现循环体。 This can be tedious, as you need to pass all references across the functor object 这可能很乏味，因为您需要在functor对象中传递所有引用
create a virtual JobItem interface for the functor, and inherit your functors from this interface 为仿函数创建一个虚拟JobItem接口，并从该接口继承您的仿函数
create a thread function which is able process individual JobItems objects 创建一个能够处理各个JobItems对象的线程函数
create a thread pool of the thread using this thread function 使用此线程函数创建线程的线程池
experiment with various synchronizations primitives to see which works best for you. 尝试各种同步原语，看看哪种方法最适合你。 While semaphore is very easy to use, its overhead is quite significant and if your loop body is very short, you do not want to pay this overhead for each loop iteration. 虽然信号量非常容易使用，但它的开销非常大，如果你的循环体很短，你不希望为每次循环迭代支付这个开销。 What worked great for me was a combination of manual reset event + atomic (interlocked) counter as a much faster alternative. 对我来说最有效的是手动重置事件+原子（互锁）计数器的组合作为一种更快的替代方案。
experiment with various JobItem scheduling strategies. 尝试各种JobItem调度策略。 If you have long enough loop, it is better if each thread picks up multiple successive JobItems at a time. 如果你有足够长的循环，那么每个线程一次获取多个连续的JobItem会更好。 This reduces the synchronization overhead and at the same time it makes the threads more cache friendly. 这减少了同步开销，同时使线程更加缓存友好。 You may also want to do this in some dynamic way, reducing the length of the scheduled sequence as you are exhausting your tasks, or letting individual threads to steal items from other thread schedules. 您可能还希望以某种动态方式执行此操作，在耗尽任务时减少计划序列的长度，或让单个线程从其他线程计划中窃取项目。