简体繁体 English

并行处理如何解决冯·诺依曼的瓶颈？

[英]How does parallel processing solve Von Neumann's bottleneck?

原文 2016-05-07 00:06:25 6 2 parallel-processing/ bus/ von-neumann

I've been reading about Von Neumann's bottleneck, and AFAIK, the problem lies in that the CPU should either fetch or modify data operations, but not both at the same time; 我一直在阅读冯·诺依曼（Von Neumann）和AFAIK的瓶颈，问题在于CPU应该获取或修改数据操作，但不能同时获取； since they both require accessing the same memory bus. 因为它们都需要访问相同的内存总线。 So, the problem mainly is in the limited bus transfer rate. 因此，问题主要在于总线传输速率有限。 I've read about how to mitigate this problem, and it mentioned that parallel processing should solve it, it doesn't depend on 1 core only, so when a core is stuck in fetch operation, other cores are working in a separate manner which cuts the computation time drastically. 我已经读过有关如何缓解此问题的信息，它提到并行处理应能解决此问题，它不仅仅依赖于一个内核，因此当一个内核卡在提取操作中时，其他内核将以一种单独的方式工作，大大减少了计算时间。

Is this a correct understanding ? 这是正确的理解吗？ if so, aren't all of these core share the same bus to memory ? 如果是这样，难道不是所有这些内核都共享同一条总线到内存吗？ which made the bottleneck from the beginning ? 哪个从一开始就成为瓶颈？

2 个解决方案

It doesn't. 没有。 The Von Neumann bottleneck refers to the fact that the processor and memory sit on opposite sides of a slow bus. 冯·诺依曼瓶颈是指处理器和内存位于慢速总线的相对两侧的事实。 If you want to compute something, you have to move inputs across the bus, to the processor. 如果要计算某些内容，则必须将输入跨总线移动到处理器。 Then you have to store the outputs to memory when the computation completes. 然后，您必须在计算完成后将输出存储到内存中。 Your throughput is limited by the speed of the memory bus. 您的吞吐量受内存总线速度的限制。

Caches 高速缓存

Caches help to mitigate this problem for many workloads, by keeping a small amount of frequently used data close to the processor. 高速缓存通过将少量经常使用的数据保持在处理器附近，有助于减轻许多工作负载的此问题。 If your workload reuses a lot of data, as many do, then you'll benefit from caching. 如果您的工作负载像许多一样重用了许多数据，那么您将从缓存中受益。 However, if you are processing a data set that's too big to fit in cache, or if your algorithm doesn't have good data reuse, it may not benefit much from caching. 但是，如果您要处理的数据集太大而无法容纳在高速缓存中，或者您的算法没有很好的数据重用性，则高速缓存可能不会带来太多好处。 Think processing a very large data set. 考虑处理一个非常大的数据集。 You need to load all the data in and store it back out at least once. 您需要加载所有数据并至少将其存储回去一次。 If you're lucky, your algorithm will only need to see each chunk of data once, and any reused values will stay in cache. 如果幸运的话，您的算法只需要查看每个数据块一次，所有重用的值都将保留在缓存中。 If not, you may end up gong over the memory bus much more than once per data element. 如果不是这样，则每个数据元素最终可能会超过一次通过内存总线。

Parallel processing 并行处理

Parallel processing is a pretty broad term. 并行处理是一个相当宽泛的术语。 Depending on how you do it, you may or may not get more bandwidth. 根据您的操作方式，您可能会或可能不会获得更多的带宽。

Shared Memory 共享内存

The way shared memory processors are implemented today doesn't do much at all to solve the Von Neumann bottleneck. 如今， 共享内存处理器的实现方式根本无法解决冯·诺依曼的瓶颈。 If anything, having more cores puts more strain on the bus, because now more processors need to fetch data from memory. 如果有的话，拥有更多的内核会对总线造成更大的压力，因为现在更多的处理器需要从内存中获取数据。 You'll need more bandwidth to feed all of them. 您将需要更多带宽来满足所有需求。 Case in point: many parallel algorithms are memory-bound, and they can't make use of all the cores on modern multi-core chips, specifically because they can't fetch data fast enough. 恰当的例子：许多并行算法受内存限制，它们无法利用现代多核芯片上的所有内核，特别是因为它们无法足够快地获取数据。 Core counts are increasing, and the bandwidth per core will likely decrease in the limit, even if the total bandwidth increases from processor to processor. 核心数量不断增加，即使每个处理器之间的总带宽增加，每个核心的带宽也可能会在限制中减少。

NUMA NUMA

Modern memory buses are getting more and more complex, and you can do things to use them more effectively. 现代内存总线变得越来越复杂，您可以做一些事情以更有效地使用它们。 For example, on NUMA machines, some memory banks are closer to some processors than others, and if you lay out data efficiently, you can get more bandwidth than if you just blindly fetched from anywhere in RAM. 例如，在NUMA计算机上，某些存储库比其他存储库更靠近某些处理器，并且如果您有效地布置数据，则可以获得比仅从RAM中的任何地方盲目提取数据更大的带宽。 But, scaling shared memory is difficult -- see Distributed Shared Memory Machines for why it's going to be hard to scale shared memory machines to more than few thousand cores. 但是，扩展共享内存很困难-请参阅分布式共享内存机，以了解将共享内存机扩展到数千个内核的原因。

Distributed Memory 分布式内存

Distributed memory machines are a type of parallel machine. 分布式存储机是并行机的一种。 These are often called clusters -- they're basically just a bunch of nodes on the same network trying to do a common task. 这些通常称为集群-它们基本上只是同一网络上试图执行一项常见任务的一堆节点。 You can get linear bandwidth scaling across a cluster, if each processor fetches only from its local memory. 如果每个处理器仅从其本地内存获取，则可以在整个群集上获得线性带宽扩展。 But, this requires you to lay out your data efficiently, so that each processor has its own chunk. 但是，这需要您有效地布置数据，以便每个处理器都有自己的块。 People call this data-parallel computing. 人们称这种为数据并行计算。 If your data is mostly data-parallel, you can probably make use of lots of processors, and you can use all your memory bandwidth in parallel. 如果您的数据大部分是数据并行的，则可能可以使用很多处理器，并且可以并行使用所有内存带宽。 If you can't parallelize your workload, or if you can't break the data up into chunks so that each is processed mostly by one or a few nodes, then you're back to having a sequential workload and you're still bound by the bandwidth of a single core. 如果您无法并行化工作负载，或者无法将数据分解成大块以便每个数据块主要由一个或几个节点处理，那么您将回到具有顺序工作负载的状态，并且仍然受到约束通过单核的带宽。

Processor in Memory (PIM) 内存处理器（PIM）

People have looked at alternative node architectures to address the Von Neumann bottleneck. 人们已经在寻找替代节点体系结构来解决冯·诺依曼瓶颈。 The one most commonly cited is probably Processor-in-memory , or PIM . 最常引用的一种可能是内存处理器（ PIM） 。 In these architectures, to get around memory bus issues, you embed some processors in the memory, kind of like a cluster but at smaller scale. 在这些体系结构中，为了解决内存总线问题，您在内存中嵌入了一些处理器，有点像集群，但是规模较小。 Each tiny core can usually do a few different arithmetic operations on its local data, so you can do some operations very fast. 每个小核通常可以对其本地数据执行一些不同的算术运算，因此您可以非常快速地执行某些运算。 Again, though, it can be hard to lay your data out in a way that actually makes this type of processing useful, but some algorithms can exploit it. 同样，尽管很难以一种实际上使这种类型的处理有用的方式来布置数据，但是某些算法可以利用它。

Summary 摘要

In summary, the Von Neumann bottleneck in a general-purpose computer, where the processor can perform any operation on data from any address in memory, comes from the fact that you have to move the data to the processor to compute anything. 总而言之，通用计算机中的冯·诺依曼瓶颈是处理器可以对内存中任何地址的数据执行任何操作的瓶颈，这是由于您必须将数据移至处理器以进行任何计算。

Simply building a parallel machine doesn't fix the problem, especially if all your cores are on the same side of the memory bus. 简单地构建并行计算机并不能解决问题，特别是如果所有内核都在内存总线的同一侧。 If you're willing to have many processors and spread them out so that they are closer to some data than other data, then you can exploit data parallelism to get more bandwidth. 如果您愿意拥有许多处理器并将它们分散开来，以便它们比其他数据更接近某些数据，那么您可以利用数据并行性来获得更多带宽。 Clusters and PIM systems are harder to program than single-core CPUs, though, and not every problem is fundamentally data-parallel. 但是，群集和PIM系统比单核CPU难编程，而且并非每个问题都基本上是数据并行的。 So the Von Neumann bottleneck is likely to be with us for some time. 因此冯·诺依曼的瓶颈可能会持续一段时间。

The Von Neumann bottleneck comes from the shared memory bus for code and data. 冯·诺依曼瓶颈来自代码和数据的共享内存总线。 If you ignore complex features of today's processors, and imagine a simple 8-bit Von Neumann processor with some RAM and some flash, the processor is constantly forced to wait for RAM operations to be completed before loading more data from flash. 如果您忽略了当今处理器的复杂功能，并想象一个带有一些RAM和一些闪存的简单8位Von Neumann处理器，那么在从闪存中加载更多数据之前，该处理器一直被迫等待RAM操作完成。 Today the mitigation is mostly done through the processor's L1 and L2 caches, and the branch predication logic embedded in the processor. 如今，缓解措施主要是通过处理器的L1和L2高速缓存以及嵌入在处理器中的分支预测逻辑来完成的。 Instructions can be preloaded into the cache and the memory bus is free to be used. 指令可以预加载到缓存中，并且内存总线可以免费使用。 Parallelization can help in specific workloads, but the reality is that today's computing paradigm is not really affected by this bottleneck much. 并行化可以帮助处理特定的工作负载，但现实情况是，如今的计算范式并未真正受到此瓶颈的很大影响。 Processors are very powerful, memories and buses very fast, and if you need more throughput you can just add more cache to the processor (as Intel does with Xeons, and AMD with Opterons). 处理器功能非常强大，内存和总线速度非常快，如果您需要更高的吞吐量，则可以向处理器添加更多缓存（就像Intel在Xeons和Opterons上所做的那样）。 Parallelization is also more of a way of dodging the issue, your parallel workload is still subject to the same rules the processor architecture imposes. 并行化还可以避免该问题，您的并行工作负载仍然要遵循处理器体系结构所施加的相同规则。 If anything, multi-threading should make the problem worse because of the multiple workloads competing for the same memory bus. 如果有的话，多线程会使问题变得更糟，因为有多个工作负载竞争同一条内存总线。 Again, the solution was simply to add more cache between the memory bus and the processor cores. 同样，解决方案只是在内存总线和处理器内核之间添加更多缓存。

As memories are getting faster, and processors not so much anymore, we might yet see this problem becoming an issue again. 随着内存的速度越来越快，处理器的功能不再那么重要，我们可能仍然会看到这个问题再次成为问题。 But then the birds are saying biocomputers are the future for general purpose computing, so hopefully the next major architecture will take past errors into consideration. 但是后来，这些鸟儿说生物计算机是通用计算的未来，因此希望下一个主要体系结构将过去的错误考虑在内。