简体   繁体   English

使用memoized函数观察到奇怪的性能

[英]Strange performance observed with memoized function

I was toying around with something that uses Euclid's algorithm to computed the GCD of two numbers. 我正在玩弄使用欧几里德算法来计算两个数字的GCD。 I implemented the standard one-liner as usual, and it worked fine. 我像往常一样实施标准单线,并且工作正常。 It's used in a algorithm that computes a series and calls gcd() several times per element as n gets larger. 它用于计算一个序列的算法中,并且随着n变大,每个元素调用gcd()几次。 I decided to see if I could do better by memoizing, so here is what I tried: 我决定通过回忆来看看我能做得更好,所以这就是我的尝试:

size_t const gcd(size_t const a, size_t const b) {
  return b == 0 ? a : gcd(b, a % b);
}

struct memoized_gcd : private std::unordered_map<unsigned long long, size_t> {
  size_t const operator()(size_t const a, size_t const b) {
    unsigned long long const key = (static_cast<unsigned long long>(a) << 32) | b;
    if (find(key) == end()) (*this)[key] = b == 0 ? a : (*this)(b, a % b);
    return (*this)[key];
  }
};

//std::function<size_t (size_t, size_t)> gcd_impl = gcd<size_t,size_t>;
std::function<size_t (size_t, size_t)> gcd_impl = memoized_gcd();

I call the chosen function through the std::function instance later. 我稍后通过std::function实例调用所选函数。 Interestingly, when for example n = 10,000, the calculation runs in 8 sec on this computer, and with the memoized version it's close to a minute, everything else being equal. 有趣的是,当例如n = 10,000时,计算在这台计算机上以8秒的速度运行,并且对于记忆版本,它接近一分钟,其他一切都相同。

Have I missed something obvious? 我错过了明显的东西吗? I am using key as an expedient so that I don't need to specialize std::hash for the hash map. 我使用key作为权宜之计,因此我不需要为哈希映射专门化std::hash The only things I can think of are maybe that the memoized version doesn't get the TCO and gcd() does, or that calling through the std::function is slow for the functor (even though I use it for both), or perhaps that I'm retarded. 我能想到的唯一的事情可能是memoized版本没有获得TCO和gcd() ,或者通过std::function调用对于functor来说很慢(即使我将它用于两者),或者也许我很迟钝。 Gurus, show me the way. 大师,给我指路。

Notes 笔记

I've tried this on win32 and win64 with g++ 4.7.0 and linux x86 with g++ 4.6.1 and 4.7.1. 我在win32和win64上用g ++ 4.7.0和linux x86用g ++ 4.6.1和4.7.1试过这个。

I also tried a version with a std::map<std::pair<size_t, size_t>, size_t> that had comparable performance to the unmemoized version. 我还尝试了一个带有std::map<std::pair<size_t, size_t>, size_t> ,其性能与unmemoized版本相当。

The main issue with your version of GCD is that it may use up huge amounts of memory, depending on the usage pattern. 您的GCD版本的主要问题是它可能会占用大量内存,具体取决于使用模式。

For example, if you compute GCD(a,b) for all pairs 0 <= a < 10,000, 0 <= b < 10,000, the memoization table will end up with 100,000,000 entries. 例如,如果为所有对0 <= a <10,000,0 <= b <10,000计算GCD(a,b),则memoization表将以100,000,000个条目结束。 Since on x86 each entry is 12 bytes, the hash table will take up at least 1.2 GB of memory. 由于在x86上每个条目都是12个字节,因此哈希表将占用至少1.2 GB的内存。 Working with that amount of memory is going to be slow. 使用这么多内存会很慢。

And of course, if you evaluate GCD with values >=10,000, you can make the table arbitrarily large... at least until you run out of the address space or the commit limit. 当然,如果您使用值> = 10,000来评估GCD,则可以使表格任意大...至少在您用完地址空间或提交限制之前。

Summary : In general, memoizing GCD is a bad idea because it leads to unbounded memory usage. 简介 :一般来说,记忆GCD是一个坏主意,因为它会导致无限制的内存使用。

There are some finer points that could be discussed: 有一些可以讨论的细节:

  • As the table exceeds various sizes, it will be stored in slower and slower memory: first L1 cache, then L2 cache, L3 cache (if present), physical memory, disk. 由于表超过了各种大小,它将存储在速度较慢和较慢的内存中:首先是L1缓存,然后是L2缓存,L3缓存(如果存在),物理内存,磁盘。 Obviously the cost of the memoization increases dramatically as the table grows. 显然,随着表格的增长,记忆的成本会急剧增加。
  • If you know that all inputs are in a small range (eg, 0 <= x < 100), using memoization or a precomputed table could still be an optimization. 如果您知道所有输入都在一个小范围内(例如,0 <= x <100),则使用memoization或预先计算的表仍然可以进行优化。 Hard to be sure - you'd have to measure in your particular scenario. 很难确定 - 您必须在特定情况下进行衡量。
  • There are potentially other ways of optimizing GCD. 可能还有其他优化GCD的方法。 Eg, I am not sure whether g++ automatically recognizes tail recursion in this example. 例如,我不确定g ++是否会在此示例中自动识别尾递归。 If not, you could get a performance boost by rewriting the recursion into a loop. 如果没有,您可以通过将递归重写为循环来获得性能提升。

But as I said, it is not surprising at all that the algorithm you posted does not perform well. 但正如我所说,你发布的算法表现不佳并不奇怪。

This isn't that surprising. 这并不奇怪。 On modern CPUs, memory access is very slow, especially if it is not in the cache. 在现代CPU上,内存访问速度非常慢,特别是如果它不在缓存中。 It's often faster to recompute a value than to store it in memory. 重新计算值通常比将其存储在内存中更快。

Frequent heap allocation (when creating new entries). 频繁的堆分配(创建新条目时)。 Also std::unordered_map lookup overhead (while it might be constant time, it is certainly slower than an plain array offset). 还有std :: unordered_map查找开销(虽然它可能是常量时间,但肯定比普通数组偏移慢)。 Cache misses also (a function of access pattern and size). 缓存未命中(访问模式和大小的函数)。

If you want to do a "pure" comparison, you can try converting it to use a static, stack-allocated plain array; 如果要进行“纯”比较,可以尝试将其转换为使用静态的,堆栈分配的普通数组; this might be a sparse lookup table that uses more memory, but it will be more representative of memoization iff you can fit the entire memoized array into your CPU cache. 这可能是一个使用更多内存的稀疏查找表,但如果您可以将整个memoized数组放入CPU缓存中,它将更具代表性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM