在N体仿真中执行较少的计算时，程序运行速度较慢？

Question

I'm writing a simple N-body simulation with particle-particle interaction. 我正在编写一个带有粒子-粒子相互作用的简单N体模拟。 I noticed that when I calculate the relative position among particles my code actually runs slower when it performs less calculations. 我注意到，当我计算粒子之间的相对位置时，当执行较少的计算时，我的代码实际上运行得较慢。

At first I tried the straight-forward implementation (assume 1D for simplicity): 首先，我尝试了简单的实现（为简单起见，假定为1D）：

for(int i=0; i<N; i++)
{
    for(int j=0; j<N; j++)
   {
       r_rel[i][j] = r[i]-r[j];
   }
}

This is just like filling a NxN matrix. 这就像填充NxN矩阵。 These loops compute every r_rel twice since, in fact, r_rel[i][j] = -r_rel[j][i] . 这些循环计算每个r_rel两次，因为实际上r_rel[i][j] = -r_rel[j][i] 。 Therefore I tried sparing some calculations implementing the following solution: 因此，我尝试节省一些实现以下解决方案的计算：

for(int i=1; i<N; i++)
{
    for(int j=0; j<i; j++)
   {
       r_rel[i][j] = r[i]-r[j];
       r_rel[j][i] = -r_rel[i][j];

   }
}

This way I only actually calculate the terms below the diagonal in my NxN matrix of relative positions. 这样，我实际上只在相对位置的NxN矩阵中计算对角线以下的项。 I expected the code to be faster as it performs less calculation but when I execute is it runs noticeably slower. 我希望代码会更快，因为它执行的计算较少，但是执行时，它的运行速度明显慢。 How is this even possible? 这怎么可能？ Thanks!! 谢谢！！

Answer 1

The first loop traverses r_rel in consecutive memory order, proceeding through each row before proceeding to the next: It accesses r_rel[i][j] while iterating through each value of j before incrementing i . 第一个循环以连续的内存顺序遍历r_rel ，依次遍历每一行，然后再进行下一行：它访问r_rel[i][j]同时在递增i之前遍历j每个值。

The second loop traverses r_rel with two moving points of access, one proceeding in consecutive memory order and the proceeding through columns of the matrix, jumping across rows. 第二个循环以两个移动访问点遍历r_rel ，一个访问点以连续的内存顺序进行，而另一行通过矩阵的列，在行之间跳转。 This latter behavior is bad for cache and has poor performance. 后一种行为不利于缓存，并且性能较差。 Traversing row-major arrays along columns is notoriously bad for cache performance. 众所周知，沿列遍历行主要数组对缓存性能不利。

Cache is expensive high-performance memory that is used to hold copies of recently accessed data or data that has been loaded from memory in anticipation of use in the near future. 高速缓存是昂贵的高性能内存，用于保存最近访问的数据或从内存加载的数据的副本，以备将来使用。 When a program uses memory in a way that often accesses data that is in cache, it may benefit from the high performance of cache memory. 当程序以经常访问高速缓存中数据的方式使用内存时，它可能会受益于高速缓存的高性能。 When a program often accesses data that is not in cache, the processor must access data in in general memory, which is much slower than cache. 当程序经常访问不在缓存中的数据时，处理器必须访问通用内存中的数据，这比缓存要慢得多。

Typical features of cache design include: 缓存设计的典型功能包括：

Cache is organized into lines , which are units of contiguous memory. 缓存被组织成几行，这是连续内存的单位。 64 bytes is a typical line size. 64字节是典型的行大小。
Cache lines are organized into sets . 高速缓存行被组织成组。 Each set is associated with certain bits from the memory address. 每个集合与来自存储器地址的某些位相关联。 Cache might have, for example, two or eight lines in each set. 缓存可能在每个集中有两行或八行。
Within each set, a line may be a copy of any portion of memory that has the bits assigned to its set. 在每个集合中，一行可以是具有分配给其集合的位的内存任何部分的副本。 For example, consider an address with bits aa…aaabbbcccccc . 例如，考虑一个具有aa…aaabbbcccccc位的地址。 The six c bits tell us which byte this is within a cache line. 六个c位告诉我们这是在高速缓存行中的哪个字节。 (2 ⁶ = 64.) The three b bits tell us which cache set this byte must go into. （2 ⁶ =64。）这三个b位告诉我们该字节必须进入哪个高速缓存集。 The a bits are recorded with the cache line, remembering where in memory it belongs. a位与高速缓存行一起记录，记住它在内存中的位置。

When a process is working through r_rel[i][j] in consecutive memory order, then, each time it accesses a member of r_rel , the one it accesses is either part of the same cache line just accessed in the previous iteration or it is the in the very next cache line. 当某个进程以连续的内存顺序通过r_rel[i][j]进行工作时，则每次访问r_rel的成员时，它所访问的成员要么是上次迭代中刚刚访问的同一缓存行的一部分，要么是在下一个缓存行中。 In the former case, the data is already in cache and is available to the processor quickly. 在前一种情况下，数据已经在缓存中，并且可以快速供处理器使用。 In the latter case, it has to be fetched from memory. 在后一种情况下，必须从内存中获取它。 (Some processors will have already initiated this fetch, as they pre-fetch data that is ahead of recent accesses to memory. They are designed to do this because such memory access is a common pattern.) （某些处理器将已经启动了此提取操作，因为它们会预先提取最近访问内存之前的数据。它们之所以设计为这样做是因为这种内存访问是一种常见的模式。）

From the above, we can see that the first set of code will have to perform one load of a cache line for each cache line in r_rel . 从上面可以看到，第一组代码将必须为r_rel每个高速缓存行执行一次高速缓存行的r_rel 。 Below, we will compare this number to the similar number for the second set of code. 下面，我们将把这个数字与第二组代码的相似数字进行比较。

In the second set of code, one of the uses of r_rel proceeds the same was as the first set of code, although it traverses only half the array. 在第二组代码中， r_rel的使用之一与第一组代码相同，尽管它仅遍历数组的一半。 For r_rel[i][j] , it performs about half the cache loads of the first code. 对于r_rel[i][j] ，它执行第一个代码的大约一半的缓存负载。 It performs a few extra loads because of some inefficient use along the diagonal, but we can neglect that. 由于沿对角线的使用效率低下，它会执行一些额外的负载，但是我们可以忽略这一点。

However, other use of r_rel , r_rel[j][i] , is troublesome. 但是， r_rel其他用法r_rel[j][i]很麻烦。 It proceeds through rows of r_rel . 它通过r_rel行进行。

The question does not give us many details, so I will make up some values for illustration. 这个问题并没有为我们提供很多细节，因此我将为您举例说明一些价值。 Suppose the elements of r_rel are four bytes each, and N , the number of elements in a row or column, is a multiple of 128. Also suppose the cache is 32,768 bytes organized into 64 sets of 8 lines of 64 bytes each. 假设r_rel的元素每个为四个字节，并且行或列中元素的数量N为128的倍数。还假设高速缓存为32,768字节，分为64组，每组8行，每行64个字节。 With this geometry, the residue (remainder when divided) of the address modulo 512 determines which cache set the memory must be assigned to. 通过这种几何结构，地址模512的余数（除法后的余数）决定了必须将存储器分配给哪个高速缓存集。

So, what happens when r_rel[j][i] is accessed is that the 64 bytes of memory around that address are brought into cache and assigned to a particular cache set. 因此，访问r_rel[j][i]时发生的情况是该地址周围的64个字节的内存被放入缓存并分配给特定的缓存集。 When, when j is incremented, the memory around that address is brought into cache and assigned to a particular cache set. 当j递增时，该地址周围的内存将进入缓存并分配给特定的缓存集。 These are the same cache set. 这些是相同的缓存集。 Because the rows are 128 elements, and each element is four bytes, the distance between two elements that are exactly one row apart is 128•4 = 512 bytes, which is the same as the number used to determine which cache set a line goes into. 因为行是128个元素，并且每个元素是4个字节，所以正好相隔一行的两个元素之间的距离为128•4 = 512字节，这与用于确定行进入哪个缓存集的数字相同。 So these two elements get assigned to the same cache set. 因此，这两个元素被分配给相同的缓存集。

That is fine at first. 起初很好。 The cache set has eight lines. 缓存集有八行。 Unfortunately, the code continues iterating j . 不幸的是，代码继续迭代j 。 Once j has been incremented eight times, it accesses a ninth element of r_rel . j增加八次后，它将访问r_rel的第九个元素。 Since a cache set has only eight lines, the processor must remove one of the previous lines from the set. 由于高速缓存集只有八行，因此处理器必须从该集中删除先前的行之一。 As the code continues to iterate j , more lines are removed. 随着代码继续迭代j ，将删除更多行。 Eventually, all the previous lines are removed. 最终，所有先前的行均被删除。 When the code finishes its iteration of j and increments i , it returns to near the beginning of the array. 当代码完成对j迭代并递增i ，它将返回到数组开头附近。

Recall that, in the first set of code, when r_rel[0][2] were accessed, it was still in cache from when r_rel[0][1] had been accessed. 回想一下，在所述第一组代码，当r_rel[0][2]被访问，它仍然是在高速缓存中时从r_rel[0][1]已被访问。 However, in the second set of code, r_rel[0][2] is long gone from cache. 但是，在第二组代码中， r_rel[0][2]早已脱离高速缓存。 The processor must load it again. 处理器必须再次加载它。

For the accesses to r_rel[j][i] , the second set of code gets effectively no benefit from cache. 对于访问r_rel[j][i] ，第二组代码实际上没有从缓存中受益。 It has to load from memory for each access. 每次访问都必须从内存中加载。 Since, in this example, there are 16 elements in each cache line (four-byte elements, 64-byte lines), it has approximately 16 times as many memory accesses for half the matrix. 因为在此示例中，每个高速缓存行中有16个元素（四字节元素，64字节行），所以对于一半的矩阵，它具有大约16倍的内存访问量。

In total, if there are x cache lines in the entire array, the first set of code loads x cache lines, and the second set of code loads about x /2 cache lines for the r_rel[i][j] accesses and about x /2•16 = 8• x cache lines for the r_rel[i][j] accesses, a total of 8.5• x cache line loads. 总的来说，如果整个数组中有x条缓存行，则第一组代码加载x条缓存行，而第二组代码加载r_rel[i][j]访问的大约x / 2条缓存行，大约x / 2•16 = 8• x个用于r_rel[i][j]访问的缓存行，总共有8.5• x个缓存行负载。

Traversing an array in column order is terrible for cache performance. 以列顺序遍历数组对于缓存性能来说是很糟糕的。

The above used example numbers. 上面使用的示例编号。 One that is most flexible is the array size, N . 最灵活的一种是数组大小N I assumed it was a multiple of 64. We can consider some other values. 我假设它是64的倍数。我们可以考虑其他一些值。 If it is a multiple of 32 instead, then r_rel[j][i] and r_rel[j+1][i] will map to different cache sets. 如果它是32的倍数，则r_rel[j][i]和r_rel[j+1][i]将映射到不同的缓存集。 However, r_rel[j][i] and r_rel[j+2][i] map to the same set. 但是， r_rel[j][i]和r_rel[j+2][i]映射到同一集合。 This means that, after eight iterations of j , only four lines in each set will have been used, so old lines will not yet need to be evicted. 这意味着，在对j进行了八次迭代之后，每组中仅使用了四行，因此不再需要逐出旧行。 Unfortunately, this helps very little, because, once i exceeds 16, the code is iterating j through enough values that the cache set is again emptied of earlier lines, so each loop on j must load every cache line it encounters. 不幸的是，这几乎无济于事，因为一旦i超过16，代码就会通过足够的值迭代j ，以使高速缓存集再次清空较早的行，因此j上的每个循环都必须加载它遇到的每个高速缓存行。

On the other hand, setting N to a value such as 73 might mitigate some of this effect. 另一方面，将N设置为诸如73的值可能会减轻这种影响。 Of course, you do not want to change the size of your array just to suit the computer hardware. 当然，您不希望仅仅为了适应计算机硬件而更改阵列的大小。 However, one thing you can do is make the dimensions of the array in memory N by NP even though only N by N elements are used. 但是，您可以做的一件事是，即使仅使用N by N元素，也要使存储器中的数组尺寸为N by NP 。 NP (standing for “N Padded”) is chosen to make the rows an odd size relative to cache geometry. NP （代表“ N填充”）可使行相对于缓存几何体具有奇数大小。 The extra elements are simply wasted. 多余的元素只是浪费了。

That provides a quick way to change the program to demonstrate that cache effects are making it slow, but it is usually not a preferred solution. 这提供了一种快速的方法来更改程序，以证明高速缓存的影响正在使其变慢，但这通常不是首选的解决方案。 Another approach is to tile access to the array. 另一种方法是平铺对阵列的访问。 Instead of iterating i and j through the entire array, the array is partitioned into tiles of some size, A rows by B columns. 不是将i和j遍历整个数组，而是将数组划分为A大小x B大小的小块。 Two outer loops iterate through all the tiles, and two inner loops iterate through the array elements within each tile. 两个外部循环遍历所有图块，并且两个内部循环遍历每个图块内的数组元素。

A and B are chosen so that all of the elements of one tile will remain in cache while the inner loops proceed. 选择A和B ，以便在进行内部循环时，一个图块的所有元素都将保留在高速缓存中。 For the sample numbers above, A and B would have to be eight or less, because only eight rows of the array can be held in one cache set. 对于上面的样本数， A和B必须为8个或更少，因为一个阵列中只能保存8行数组。 (There may be other considerations that would make the optimal tile size somewhat smaller. Or, for different element sizes or values of N , the optimal tile size might be larger.) （可能还有其他一些考虑因素，会使最佳图块大小略小。或者，对于不同的元素大小或N值，最佳图块大小可能会更大。）

Note that tiling raises some issues in writing the code. 请注意，平铺会在编写代码时引起一些问题。 When processing a tile on the diagonal, the code will be handling elements from two points within the same tile. 在对角线上处理图块时，代码将处理同一图块中两个点的元素。 When processing a tile off the diagonal, the code will be handling elements from one point within one tile and a transposed point within another tile. 当处理对角线上的图块时，代码将处理一个图块中一个点的元素，以及另一图块中的转置点的元素。 This may affect both the code manipulating the array indices and the bounds of the inner loops. 这可能会影响操纵数组索引的代码以及内部循环的边界。 For on-diagonal tiles, the inner loops will look similar to your j < i condition, processing a triangle. 对于对角拼贴，内部循环看起来类似于j < i条件，处理一个三角形。 For off-diagonal tiles, the inner loops will process a full square (or rectangle if A and B differ). 对于非对角线图块，内部循环将处理一个完整的正方形（如果A和B不同，则为矩形）。

在N体仿真中执行较少的计算时，程序运行速度较慢？

问题描述

1 个解决方案

解决方案1
3 已采纳 2019-04-20 10:31:22

在N体仿真中执行较少的计算时，程序运行速度较慢？

问题描述

1 个解决方案

解决方案1 3 已采纳 2019-04-20 10:31:22

解决方案1
3 已采纳 2019-04-20 10:31:22