简体   繁体   English

解释两个几乎相同的算法的性能差异

[英]Explaining performance difference in two nearly identical algorithms

This question is rather vague and I don't really need an answer to it, but I am very curious about what the answer might be, so I'll ask it anyway. 这个问题相当模糊,我真的不需要答案,但我很好奇答案可能是什么,所以无论如何我都会问。

I have an algorithm which generates a huge amount of matrices. 我有一个生成大量矩阵的算法。 It later runs a second algorithm over it which generates a solution. 它后来运行第二个算法,生成一个解决方案。 I ran it 100 times and it took an average of ~17 seconds. 我跑了100次,平均花了大约17秒。

The second algorithm does nearly exactly the same, with the only difference being, that the second algorithm is run over each matrix as soon as it is generated, so that they actually never need to be stored anywhere. 第二种算法几乎完全相同,唯一的区别在于第二种算法在生成后立即在每个矩阵上运行,因此它们实际上永远不需要存储在任何地方。 This variant obviously needs much less space, which is why I made it, but it also needs an average of only ~2 seconds for the same problem. 这个变种显然需要更少的空间,这就是我制作它的原因,但对于同样的问题它也只需要平均约2秒。

I didn't expect it to run faster, especially not that much. 我没想到它跑得更快,特别是没那么多。

The code is quite big, so I will try to outline the difference in something resembling pseudo-code: 代码非常大,所以我将尝试概述类似伪代码的区别:

recursiveFill(vector<Matrix> &cache, Matrix permutation) {
  while(!stopCondition) {
    // generate next matrix from current permutation
    if(success)
      cache.push_back(permutation);
    else
      recursiveFill(cache, permutation);
    // some more code
  }
}

recursiveCheck(Matrix permutation) {
  while(!stopCondition) {
    // alter the matrix some
    if(success)
      checkAlgorithm(permutation);
    else
      recursiveCheck(permutation);
    // some more code
  }
}

After the recursive fill, a loop runs the checkAlgorithm over all elements in the cache. 在递归填充之后,循环在高速缓存中的所有元素上运行checkAlgorithm。 Everything I didn't include in the code is identical in both algorithms. 我在代码中未包含的所有内容在两种算法中都是相同的。 I guessed that the storing in the vector is what eats up all the time, but if i recall correctly, the size of a c++ vector doubles each time it is overfilled, so a reallocation shouldn't happen too often. 我猜测向量中的存储是一直在吃的东西,但是如果我没记错的话,每次填充时c ++向量的大小都会翻倍,所以重新分配不应该经常发生。 Any ideas? 有任何想法吗?

The culprit here is probably temporal locality. 这里的罪魁祸首可能是时间局部性。 You CPU cache is only so big, so when you save off everything after each run and come back to it later, it has left your CPU caches in the meanwhile and takes longer (10s to 100s of cycles) to access. 你的CPU缓存只是如此之大,所以当你在每次运行后保存所有内容并稍后再回来时,它会同时离开你的CPU缓存并需要更长的时间(10到100秒的周期)才能访问。 With the second method, it's right there in L1 (or possibly MMX registers) and takes only a cycle or two to access. 使用第二种方法,它就在L1(或可能是MMX寄存器)中,并且只需要一个或两个周期来访问。

In optimization, you generally want to think like the Wu-Tang Clan: Cache Rules Everything Around Me. 在优化中,你通常想要像吴唐家族一样思考:缓存规则我周围的一切。

Some people have done testing on this , and copies in cache are often much cheaper than dereferences into main memory. 有些人已经对此进行了测试 ,缓存中的副本通常比对主内存的解除引用便宜得多。

I guess that the additional time is due to the copying of matrices within the vector . 我猜这个额外的时间是由于在vector复制矩阵。 With the times you give, one pass through the data takes 20 or 170 ms, which is on the right order of magnitude for a lot of copying. 根据您提供的时间,一次通过数据需要20或170毫秒,这对于大量复制来说是正确的数量级。

Remember that, even though the overhead of copying due to reallocations of the vector is linear, every inserted matrix is copied twice on average, once during insertion, and once during reallocation. 请记住,即使由于向量的重新分配而导致的复制开销是线性的,每个插入的矩阵平均复制两次,在插入期间复制一次,在重新分配期间复制一次。 In conjunction with the cache clobbering effect of copying a large amount of data, this can produce the additional runtime. 结合复制大量数据的缓存破坏效果,这可以产生额外的运行时。

Now you might say: But I'm also copying the matrices when I pass them to the recursive call, shouldn't I expect the first algorithm to take at most three times the time of the second one? 现在您可能会说:但是当我将它们传递给递归调用时,我也会复制矩阵,我不应该期望第一个算法最多花费三倍于第二个算法的时间吗?
The answer is, that any recursive decent perfectly cache friendly if it is not hampered by cache utilization for data on the heap. 答案是,如果不受堆上数据的缓存利用率的影响,任何递归都可以完全缓存友好。 Thus, almost all the copying done in the recursive decent does not even reach the L2 cache. 因此,几乎所有在递归中完成的复制甚至都没有到达L2缓存。 If you clobber your entire cache from time to time by doing a vector reallocation, you will resume with an entirely cold cache afterwards. 如果通过执行vector重新分配来不时地破坏整个缓存,则之后将恢复完全冷缓存。

Strictly speaking a vector doesn't have to double each growth, it just needs to grow geometrically to supply the required amortized constant time. 严格地说, vector不必使每次增长加倍,它只需要几何增长以提供所需的摊销恒定时间。

In this case if you have sufficiently large number of matrices the growth and required data copies could still be the issue. 在这种情况下,如果您有足够多的矩阵,增长和所需的数据副本仍可能是问题。 Or it could be swapping to allocate enough memory. 或者它可以交换分配足够的内存。 The only way to know for sure is to profile on the system where you experience this difference. 肯定知道的唯一方法就是在您遇到这种差异在系统上的配置文件

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 两个几乎相同的呼叫,一个工作一个失败 - two nearly identical calls, one works one fails 在Visual C ++中生成极其怪异的代码,几乎完全相同的代码; 3倍速差 - Extremely bizarre code generation in Visual C++, for nearly identical code; 3x speed difference 两个containsDuplicates算法之间的时间复杂度差异 - Time complexity difference between two containsDuplicates algorithms 两个几乎相同的C ++函数用于完成几乎相同的任务(查找25个数字的行的平均值),但是只有一个函数可以吗? - Two nearly identical C++ functions being used to accomplish nearly the same task(find the average of lines of 25 numbers), yet only one is functional? 解释c ++中语句和表达式之间的区别 - Explaining the difference between a statement and an expression in c++ 两个几乎相同的循环中的性能差异 - performance difference in two almost same loop 对几乎相同的代码执行不同的向量求和 - Sum over vector performing differently for nearly identical code 范围算法和标准算法之间的区别 - Difference between Ranges algorithms and std algorithms 解释将两个字符串相加的 C++ 代码 - Explaining C++ code for adding two strings 合并两个堆的算法 - Algorithms on merging two heaps
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM