简体   繁体   English

C++ 比较预先保留的哈希映射(std::unordered_map)与整数键和连续数据数组(std::vector)

[英]C++ comparing a pre-reserved hash map(std::unordered_map) with integer key and contiguous data array(std::vector)

Assume that using a hash map structure with int key type:假设使用具有int键类型的哈希映射结构:

std::unordered_map<int, data_type> um;

Plus, when the total(or maximum) number of elements N is known, hash table can be constructed in advance.另外,当元素的总数(或最大)数N已知时,可以提前构建哈希表。

um.reserve(N); // This will chainly call rehash() function...

Here, an integer itself can be used as an identity(hash) function for a hash table, as far as I know.在这里,据我所知,整数本身可以用作哈希表的身份(哈希)函数

Meanwhile, for a contiguous data set(ie std::vector , or a simple array), it can be random-accessed by displacement from the address of front-most data.同时,对于连续数据集(即std::vector或简单数组),可以通过从最前面数据的地址位移来随机访问它。

Both containers use int as an accessing key, like this:两个容器都使用int作为访问密钥,如下所示:

um[1] = data_type(1); //std::unordered_map<int, data_type>
v[1] = data_type(1); //std::vector<data_type>

Then, is there any difference between the constructed hash table and std::vector , in memory usage or in searching mechanism/performance, or in anything else?那么,构造的哈希表和std::vector在内存使用或搜索机制/性能或其他方面有什么区别吗?

Let's make the problem tangible.让我们把问题具体化。

If I know that 3 keys 0 , 5 , 9987 are certainly used, but keys 1 ~ 9986 may or may not be used.如果我知道0 , 5 , 9987 3个键肯定会使用,但键1 ~ 9986可能会也可能不会使用。

If I know no key in the set would be bigger than 10000 , then using std::vector of size 10000 will guarantee O(1) time complexity for accessing random data, but memory would be wasted.如果我知道集合中没有键大于10000 ,那么使用大小为10000 std::vector将保证访问随机数据的时间复杂度为 O(1),但会浪费内存。

In this situation, does std::unordered_map produce a better solution for the problem?在这种情况下, std::unordered_map是否为问题提供了更好的解决方案? *I mean, a solution that saves as much memory as possible while maintaining the time complexity in the same level. *我的意思是,一种在将时间复杂度保持在同一级别的同时尽可能多地节省内存的解决方案。

Everything is different.一切都不一样了。

An unordered_map has the concept of buckets - unordered_map 有的概念 -

A bucket is a slot in the container's internal hash table to which elements are assigned based on the hash value of their key.存储桶是容器内部哈希表中的一个槽,元素根据其键的哈希值分配到该槽。 Buckets are numbered from 0 to (bucket_count-1).桶的编号从 0 到 (bucket_count-1)。

An unordered_map calculates hash value of the key which points to a bucket. unordered_map 计算指向桶的键的哈希值。 The desired value is in that bucket.所需的值在该存储桶中。 Now note that multiple keys can point to a single bucket.现在请注意,多个键可以指向单个存储桶。 In your case it may even happen that um[0] , um[5] and um[9987] all lie in the same bucket!在您的情况下,甚至可能发生um[0]um[5]um[9987]都在同一个桶中! Search within bucket is linear in time.桶内搜索在时间上是线性的。

In this situation, does std::unordered_map produce a better solution for the problem?在这种情况下, std::unordered_map 是否为问题提供了更好的解决方案?

In case you have sparse data, use an unordered_map but with an appropriate reserve (or no reserve at all and use the default allocation policy).如果您有稀疏数据,请使用 unordered_map 但具有适当的保留(或根本没有保留并使用默认分配策略)。 There's no point if you do a myMap.reserve(MAX_ELEMENTS) since that will again just lead to memory wastage.如果您执行myMap.reserve(MAX_ELEMENTS)则毫无意义,因为这将再次导致内存浪费。

Else, use a vector.否则,使用向量。 You get a guaranteed O(1) lookup.您将获得有保证的O(1)查找。 Since its linear its super cache-friendly.由于它是线性的,它对缓存非常友好。 Whereas on an unordered_map you may get the worst case lookup of O(N)而在 unordered_map 上,您可能会得到O(N)的最坏情况查找

Plus, when the total(or maximum) number of elements N is known, hash table can be constructed in advance.另外,当元素的总数(或最大)数 N 已知时,可以提前构建哈希表。

um.reserve(N);嗯.reserve(N); // This will chainly call rehash() function... // 这将链式调用 rehash() 函数...

Here, an integer itself can be used as an identity(hash) function for a hash table, as far as I know.在这里,据我所知,整数本身可以用作哈希表的身份(哈希)函数。

That's true, and reasonable in two very different scenarios: 1) when the values are pretty much contiguous with perhaps a few missing values, or 2) when the values are quite random.这是真的,并且在两种非常不同的情况下是合理的:1) 当这些值几乎与一些缺失值相邻时,或者 2) 当这些值非常随机时。 In many other situations, you may risk excessive hash table collisions if you don't provide a meaningful hash function.在许多其他情况下,如果您不提供有意义的哈希函数,您可能会面临过度哈希表冲突的风险。

Then, is there any difference between the constructed hash table and std::vector, in memory usage or in searching mechanism/performance, or in anything else?那么,构造的哈希表和 std::vector 之间在内存使用或搜索机制/性能或其他方面有什么区别吗?

Yes.是的。 After your .reserve(N) , the hash table allocates a contiguous block of memory (basically, an array) for at least N "buckets".在您的.reserve(N) ,哈希表为至少N “存储桶”分配一个连续的内存块(基本上是一个数组)。 If we consider the GCC implementation, N will be rounded up to a prime.如果我们考虑 GCC 实现,N 将四舍五入为素数。 Each bucket may store an iterator into a forward-linked list of pair<int, data_type> nodes.每个桶可以将一个迭代器存储到一个pair<int, data_type>节点的前向链表中。

So, if you actually put N entries into the hash table, you have...所以,如果你真的把 N 个条目放入哈希表中,你有......

  • an array of >= N elements of sizeof(forward-list-iterator) size >= N 个sizeof(forward-list-iterator)大小的元素的数组
  • N memory allocations of >= sizeof(pair<int, data_type>) + sizeof(next-pointer/iterator for forward-list) N 个内存分配 >= sizeof(pair<int, data_type>) + sizeof(next-pointer/iterator for forward-list)

...whilst the vector only uses about N * sizeof(data_type) bytes of memory: potentially a small fraction of the memory used by the hash table, and as all the vector's memory for data_type s is contiguous, you're much more likely to benefit from the CPU caching elements adjacent to one you're currently trying to access, such that they're all much faster to access later. ...虽然vector仅使用大约N * sizeof(data_type)个字节的内存:可能是哈希表使用的内存的一小部分,并且由于data_type s 的所有向量内存是连续的,你更有可能受益于与您当前尝试访问的元素相邻的 CPU 缓存元素,以便以后访问它们都快得多。

On the other hand, if you haven't put many elements into the hash table, then the main thing using memory is the array of buckets containing iterators, which are usually the size of pointers (eg 32 or 64 bits each), whereas the vector of data_type - if you reserve(N) there too - will already have allocated N * sizeof(data_type) bytes of memory - for large data_type s that may be massively more than the hash table.另一方面,如果你没有把很多元素放入哈希表,那么使用内存的主要是包含迭代器的桶数组,它通常是指针的大小(例如每个 32 或 64 位),而data_type向量 - 如果你也reserve(N) - 将已经分配了N * sizeof(data_type)个字节的内存 - 对于可能比哈希表大得多的大型data_type s。 Still, you can often allocate virtual memory, and if you haven't faulted the pages of memory in such that they need physical backing memory, there's no meaningful memory usage or performance penalty to your program or computer.尽管如此,您仍然可以经常分配虚拟内存,并且如果您没有将内存页面错误地导致它们需要物理后备内存,那么您的程序或计算机就没有有意义的内存使用或性能损失。 (At least with 64 bit programs, virtual address space is effectively unlimited). (至少对于 64 位程序,虚拟地址空间实际上是无限的)。

If I know that 3 keys 0,5, 9987 are certainly used, but keys 1~9986 may or may not be used.如果我知道 3 个键 0,5, 9987 肯定会使用,但键 1~9986 可能会也可能不会使用。

If I know no key in the set would be bigger than 10000, then using std::vector of size 10000 will guarantee O(1) time complexity for accessing random data, but memory would be wasted.如果我知道集合中没有键大于 10000,那么使用大小为 10000 的 std::vector 将保证访问随机数据的时间复杂度为 O(1),但会浪费内存。

In this situation, does std::unordered_map produce a better solution for the problem?在这种情况下, std::unordered_map 是否为问题提供了更好的解决方案? *I mean, a solution that saves as much memory as possible while maintaining the time complexity in the same level. *我的意思是,一种在将时间复杂度保持在同一级别的同时尽可能多地节省内存的解决方案。

In this situation, if you reversed(10000) up front and the data_type was not significantly bigger than an iterator/pointer, then the unordered_map would be unequivocably worse in every regard.在这种情况下,如果您reversed(10000)并且data_type没有明显大于迭代器/指针,那么unordered_map在各个方面都会明显更糟。 If you don't reserve up front, the hash table would only allocate space for a handful of buckets, and you'd be using a lot less virtual address space than a vector with 10000 elements (even if data_type was bool ).如果您不预先保留,哈希表只会为少数存储桶分配空间,并且您使用的虚拟地址空间将比具有 10000 个元素的vector量少得多(即使data_typebool )。

If you have only 3 elements to pack, the best solution is to use std::vector<std::pair<int, data_type>> :) It takes even less memory than std::unordered_map<int, data_type> (which actually allocates several vectors-buckets), and the lookup performance is also the best for small number of elements due to very small constants.如果你只有 3 个元素要打包,最好的解决方案是使用std::vector<std::pair<int, data_type>> :) 它比std::unordered_map<int, data_type>占用的内存更少(实际上分配几个向量桶),并且由于常量非常小,查找性能对于少量元素也是最好的。

For larger maps, O(1) complexity is guaranteed by both std::vector<data_type> , and std::unordered_map<int, data_type> , but the constant hiding in O is much lower for the vector, since it doesn't need to check the element against other elements in the bucket.对于较大的地图, O(1)复杂度由std::vector<data_type>std::unordered_map<int, data_type> ,但对于向量,隐藏在O的常数要低得多,因为它没有需要根据桶中的其他元素检查元素。 I would suggest to always prefer vector unless you lack memory to fit it, in which case you can save the memory using the unordered_map by sacrificing a bit of performance.我建议总是更喜欢 vector ,除非你缺乏适合它的内存,在这种情况下,你可以通过牺牲一些性能来使用unordered_map来节省内存。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM