解释两个几乎相同的算法的性能差异

Question

This question is rather vague and I don't really need an answer to it, but I am very curious about what the answer might be, so I'll ask it anyway. 这个问题相当模糊，我真的不需要答案，但我很好奇答案可能是什么，所以无论如何我都会问。

I have an algorithm which generates a huge amount of matrices. 我有一个生成大量矩阵的算法。 It later runs a second algorithm over it which generates a solution. 它后来运行第二个算法，生成一个解决方案。 I ran it 100 times and it took an average of ~17 seconds. 我跑了100次，平均花了大约17秒。

The second algorithm does nearly exactly the same, with the only difference being, that the second algorithm is run over each matrix as soon as it is generated, so that they actually never need to be stored anywhere. 第二种算法几乎完全相同，唯一的区别在于第二种算法在生成后立即在每个矩阵上运行，因此它们实际上永远不需要存储在任何地方。 This variant obviously needs much less space, which is why I made it, but it also needs an average of only ~2 seconds for the same problem. 这个变种显然需要更少的空间，这就是我制作它的原因，但对于同样的问题它也只需要平均约2秒。

I didn't expect it to run faster, especially not that much. 我没想到它跑得更快，特别是没那么多。

The code is quite big, so I will try to outline the difference in something resembling pseudo-code: 代码非常大，所以我将尝试概述类似伪代码的区别：

recursiveFill(vector<Matrix> &cache, Matrix permutation) {
  while(!stopCondition) {
    // generate next matrix from current permutation
    if(success)
      cache.push_back(permutation);
    else
      recursiveFill(cache, permutation);
    // some more code
  }
}

recursiveCheck(Matrix permutation) {
  while(!stopCondition) {
    // alter the matrix some
    if(success)
      checkAlgorithm(permutation);
    else
      recursiveCheck(permutation);
    // some more code
  }
}

After the recursive fill, a loop runs the checkAlgorithm over all elements in the cache. 在递归填充之后，循环在高速缓存中的所有元素上运行checkAlgorithm。 Everything I didn't include in the code is identical in both algorithms. 我在代码中未包含的所有内容在两种算法中都是相同的。 I guessed that the storing in the vector is what eats up all the time, but if i recall correctly, the size of a c++ vector doubles each time it is overfilled, so a reallocation shouldn't happen too often. 我猜测向量中的存储是一直在吃的东西，但是如果我没记错的话，每次填充时c ++向量的大小都会翻倍，所以重新分配不应该经常发生。 Any ideas? 有任何想法吗？

Answer 1

The culprit here is probably temporal locality. 这里的罪魁祸首可能是时间局部性。 You CPU cache is only so big, so when you save off everything after each run and come back to it later, it has left your CPU caches in the meanwhile and takes longer (10s to 100s of cycles) to access. 你的CPU缓存只是如此之大，所以当你在每次运行后保存所有内容并稍后再回来时，它会同时离开你的CPU缓存并需要更长的时间（10到100秒的周期）才能访问。 With the second method, it's right there in L1 (or possibly MMX registers) and takes only a cycle or two to access. 使用第二种方法，它就在L1（或可能是MMX寄存器）中，并且只需要一个或两个周期来访问。

In optimization, you generally want to think like the Wu-Tang Clan: Cache Rules Everything Around Me. 在优化中，你通常想要像吴唐家族一样思考：缓存规则我周围的一切。

Some people have done testing on this , and copies in cache are often much cheaper than dereferences into main memory. 有些人已经对此进行了测试，缓存中的副本通常比对主内存的解除引用要便宜得多。

Answer 2

I guess that the additional time is due to the copying of matrices within the vector . 我猜这个额外的时间是由于在vector复制矩阵。 With the times you give, one pass through the data takes 20 or 170 ms, which is on the right order of magnitude for a lot of copying. 根据您提供的时间，一次通过数据需要20或170毫秒，这对于大量复制来说是正确的数量级。

Remember that, even though the overhead of copying due to reallocations of the vector is linear, every inserted matrix is copied twice on average, once during insertion, and once during reallocation. 请记住，即使由于向量的重新分配而导致的复制开销是线性的，每个插入的矩阵平均复制两次，在插入期间复制一次，在重新分配期间复制一次。 In conjunction with the cache clobbering effect of copying a large amount of data, this can produce the additional runtime. 结合复制大量数据的缓存破坏效果，这可以产生额外的运行时。

Now you might say: But I'm also copying the matrices when I pass them to the recursive call, shouldn't I expect the first algorithm to take at most three times the time of the second one? 现在您可能会说：但是当我将它们传递给递归调用时，我也会复制矩阵，我不应该期望第一个算法最多花费三倍于第二个算法的时间吗？
The answer is, that any recursive decent perfectly cache friendly if it is not hampered by cache utilization for data on the heap. 答案是，如果不受堆上数据的缓存利用率的影响，任何递归都可以完全缓存友好。 Thus, almost all the copying done in the recursive decent does not even reach the L2 cache. 因此，几乎所有在递归中完成的复制甚至都没有到达L2缓存。 If you clobber your entire cache from time to time by doing a vector reallocation, you will resume with an entirely cold cache afterwards. 如果通过执行vector重新分配来不时地破坏整个缓存，则之后将恢复完全冷缓存。

Answer 3

Strictly speaking a vector doesn't have to double each growth, it just needs to grow geometrically to supply the required amortized constant time. 严格地说， vector不必使每次增长加倍，它只需要几何增长以提供所需的摊销恒定时间。

In this case if you have sufficiently large number of matrices the growth and required data copies could still be the issue. 在这种情况下，如果您有足够多的矩阵，增长和所需的数据副本仍可能是问题。 Or it could be swapping to allocate enough memory. 或者它可以交换分配足够的内存。 The only way to know for sure is to profile on the system where you experience this difference. 肯定知道的唯一方法就是在您遇到这种差异在系统上的配置文件 。

解释两个几乎相同的算法的性能差异

问题描述

3 个解决方案

解决方案1
2 2014-04-16 18:51:18

解决方案2
2 已采纳 2014-04-16 19:01:16

解决方案3
0 2014-04-16 18:47:35

解释两个几乎相同的算法的性能差异

问题描述

3 个解决方案

解决方案1 2 2014-04-16 18:51:18

解决方案2 2 已采纳 2014-04-16 19:01:16

解决方案3 0 2014-04-16 18:47:35

解决方案1
2 2014-04-16 18:51:18

解决方案2
2 已采纳 2014-04-16 19:01:16

解决方案3
0 2014-04-16 18:47:35