简体   繁体   English

c++中map和unordered_map的性能差异

[英]Difference in performance between map and unordered_map in c++

I have a simple requirement, i need a map of type .我有一个简单的要求,我需要一个类型的地图。 however i need fastest theoretically possible retrieval time.但是我需要最快的理论上可能的检索时间。

i used both map and the new proposed unordered_map from tr1 i found that at least while parsing a file and creating the map, by inserting an element at at time.我同时使用了 map 和 tr1 中新提出的 unordered_map 我发现至少在解析文件和创建映射时,通过在时间插入一个元素。

map took only 2 minutes while unordered_map took 5 mins. map 只用了 2 分钟,而 unordered_map 用了 5 分钟。

As i it is going to be part of a code to be executed on Hadoop cluster and will contain ~100 million entries, i need smallest possible retrieval time.由于它将成为要在 Hadoop 集群上执行的代码的一部分,并且将包含约 1 亿个条目,因此我需要尽可能短的检索时间。

Also another helpful information: currently the data (keys) which is being inserted is range of integers from 1,2,... to ~10 million.还有另一个有用的信息:目前正在插入的数据(键)是从 1,2,... 到 ~1000 万的整数范围。

I can also impose user to specify max value and to use order as above, will that significantly effect my implementation?我还可以强制用户指定最大值并使用上述顺序,这会显着影响我的实现吗? (i heard map is based on rb trees and inserting in increasing order leads to better performance (or worst?) ) (我听说地图基于 rb 树,并且按递增顺序插入会导致更好的性能(或最差?))

here is the code这是代码

map<int,int> Label // this is being changed to unordered_map  
fstream LabelFile("Labels.txt");  


// Creating the map from the Label.txt  
if (LabelFile.is_open())  
{  
    while (! LabelFile.eof() )  
    {             
        getline (LabelFile,inputLine);  
        try  
        {  
            curnode=inputLine.substr(0,inputLine.find_first_of("\t"));  
            nodelabel=inputLine.substr(inputLine.find_first_of("\t")+1,inputLine.size()-1);  
            Label[atoi(curnode.c_str())]=atoi(nodelabel.c_str());  
        }  
        catch(char* strerr)  
        {  
            failed=true;  
            break;  
        }  
    }  
    LabelFile.close(); 
}

Tentative Solution: After review of comments and answers, i believe a Dynamic C++ array would be the best option, since the implementation will use dense keys.暂定解决方案:在查看评论和答案后,我相信动态 C++ 数组将是最佳选择,因为实现将使用密集键。 Thanks谢谢

Insertion for unordered_map should be O(1) and retrieval should be roughly O(1) , (its essentially a hash-table). unordered_map 的插入应该是O(1)并且检索应该大约是O(1) ,(它本质上是一个哈希表)。

Your timings as a result are way OFF , or there is something WRONG with your implementation or usage of unordered_map.您的时间,结果是方式关闭,或者有什么不对您的实现或unordered_map的使用。

You need to provide some more information, and possibly how you are using the container.您需要提供更多信息,以及您可能如何使用容器。

As per section 6.3 of n1836 the complexities for insertion/retreival are given:根据 n1836 的第 6.3 节,给出了插入/检索的复杂性:

One issue you should consider is that your implementation may need to continually be rehashing the structure, as you say you have 100mil+ items .您应该考虑的一个问题是您的实现可能需要不断地重新调整结构,正如您所说的您有100mil+ items In that case when instantiating the container, if you have a rough idea about how many "unique" elements will be inserted into the container, you can pass that in as a parameter to the constructor and the container will be instantiated accordingly with a bucket-table of appropriate size.在这种情况下,在实例化容器时,如果您对将有多少“唯一”元素插入容器有一个粗略的想法,您可以将其作为参数传递给构造函数,容器将相应地使用一个桶进行实例化 -大小合适的桌子。

The extra time loading the unordered_map is due to dynamic array resizing.加载 unordered_map 的额外时间是由于动态数组大小调整所致。 The resizing schedule is to double the number of cells each when the table exceeds it's load factor.调整大小计划是在表格超过其加载因子时将每个单元格的数量加倍。 So from an empty table, expect O(lg n) copies of the entire data table.因此,从一个空表中,期望整个数据表的 O(lg n) 个副本。 You can eliminate these extra copies by sizing the hash table upfront.您可以通过预先调整哈希表的大小来消除这些额外的副本。 Specifically具体来说

Label.reserve(expected_number_of_entries / Label.max_load_factor());

Dividing by the max_load_factor is to account for the empty cells that are necessary for the hash table to operate.除以 max_load_factor 是为了说明哈希表运行所必需的空单元格。

unordered_map (at least in most implementations) gives fast retrieval, but relatively poor insertion speed compared to map. unordered_map(至少在大多数实现中)提供快速检索,但与 map 相比插入速度相对较差。 A tree is generally at its best when the data is randomly ordered, and at its worst when the data is ordered (you constantly insert at one end of the tree, increasing the frequency of re-balancing).当数据随机排序时,树通常处于最佳状态,而在数据有序时则处于最差状态(您不断地在树的一端插入,增加重新平衡的频率)。

Given that it's ~10 million total entries, you could just allocate a large enough array, and get really fast lookups -- assuming enough physical memory that it didn't cause thrashing, but that's not a huge amount of memory by modern standards.鉴于它总共有大约 1000 万个条目,您可以分配一个足够大的数组,并获得非常快速的查找——假设有足够的物理内存不会导致抖动,但按照现代标准,这并不是大量的内存。

Edit: yes, a vector is basically a dynamic array.编辑:是的,向量基本上是一个动态数组。

Edit2: The code you've added some some problems. Edit2:您添加了一些问题的代码。 Your while (! LabelFile.eof() ) is broken.你的while (! LabelFile.eof() )坏了。 You normally want to do something like while (LabelFile >> inputdata) instead.你通常想要做一些类似while (LabelFile >> inputdata)事情。 You're also reading the data somewhat inefficiently -- what you apparently expecting is two numbers separated by a tab.您还有些低效地读取数据——您显然期望的是由制表符分隔的两个数字。 That being the case, I'd write the loop something like:既然如此,我会写这样的循环:

while (LabelFile >> node >> label)
    Label[node] = label;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM