简体   繁体   English

提高Python中非常大的字典的性能

[英]Improving performance of very large dictionary in Python

I find that if I initialize an empty dictionary at the beginning, and then adding elements to the dictionary in a for loop (about 110,000 keys, the value for each key is a list, also increasing in the loop), the speed goes down as for loop goes. 我发现如果我在开头初始化一个空字典,然后在for循环中向字典添加元素(大约110,000个键,每个键的值是一个列表,也在循环中增加),速度下降为for循环去。

I suspect that the problem is, the dictionary does not know the number of keys at init time and it is not doing something very smart, so perhaps the storage collision becomes quite often and it slows down. 我怀疑问题是,字典在初始化时并不知道密钥的数量而且它没有做一些非常聪明的事情,所以也许存储冲突变得非常频繁而且速度变慢。

If I know the number of keys and exactly what are those keys, is there any way in python to make a dict (or a hashtable) work more efficiently? 如果我知道密钥的数量以及这些密钥究竟是什么,那么在python中是否有任何方法可以使dict(或哈希表)更有效地工作? I vaguely remember that if you know the keys, you can design the hash function smartly (perfect hash?) and allocate the space beforehand. 我依稀记得,如果你知道密钥,你可以巧妙地设计哈希函数(完美哈希?)并预先分配空间。

If I know the number of keys and exactly what are those keys, is there any way in python to make a dict (or a hashtable) work more efficiently? 如果我知道密钥的数量以及这些密钥究竟是什么,那么在python中是否有任何方法可以使dict(或哈希表)更有效地工作? I vaguely remember that if you know the keys, you can design the hash function smartly (perfect hash?) and allocate the space beforehand. 我依稀记得,如果你知道密钥,你可以巧妙地设计哈希函数(完美哈希?)并预先分配空间。

Python doesn't expose a pre-sizing option to speed-up the "growth phase" of a dictionary, nor does it provide any direct controls over "placement" in the dictionary. Python没有公开预先调整大小的选项来加速字典的“增长阶段”,也没有提供对字典中“放置”的任何直接控制。

That said, if the keys are always known in advance, you can store them in a set and build your dictionaries from the set using dict.fromkeys() . 也就是说,如果密钥总是事先知道,您可以将它们存储在一个集合中,并使用dict.fromkeys()从集合中构建字典。 That classmethod is optimized to pre-size the dictionary based on the set size and it can populate the dictionary without any new calls to __hash__(): 该类方法经过优化,可根据设置的大小预先调整字典大小 ,并且可以填充字典而无需对__hash __()进行任何新调用:

>>> keys = {'red', 'green', 'blue', 'yellow', 'orange', 'pink', 'black'}
>>> d = dict.fromkeys(keys)  # dict is pre-sized to 32 empty slots

If reducing collisions is your goal, you can run experiments on the insertion order in the dictionary to minimize pile-ups. 如果减少碰撞是您的目标,您可以在字典中的插入顺序上运行实验,以最大限度地减少堆积。 (Take a look at Brent's variation on Algorithm D in Knuth's TAOCP to get an idea of how this is done). (看看布伦特在Knuth的TAOCP中对算法D的变化,以了解如何完成此操作)。

By instrumenting a pure Python model for dictionaries (such as this one ), it is possible to count the weighted-average number of probes for an alternative insertion order. 通过为字典(例如这个 )设置纯Python模型,可以计算替代插入顺序的加权平均探测数。 For example, inserting dict.fromkeys([11100, 22200, 44400, 33300]) averages 1.75 probes per lookup. 例如,插入dict.fromkeys([11100, 22200, 44400, 33300])平均每次查找1.75个探测器。 That beats the 2.25 average probes per lookup for dict.fromkeys([33300, 22200, 11100, 44400]) . 这比dict.fromkeys([33300, 22200, 11100, 44400])每次查找的平均探测次数dict.fromkeys([33300, 22200, 11100, 44400]) 2.25。

Another "trick" is to increase spareness in a fully populated dictionary by fooling it into increasing its size without adding new key s: 另一个“技巧”是通过欺骗它增加其大小而不添加新密钥来增加完全填充的字典中的备用:

 d = dict.fromkeys(['red', 'green', 'blue', 'yellow', 'orange'])
 d.update(dict(d))     # This makes room for additional keys
                       # and makes the set collision-free.

Lastly, you can introduce your own custom __hash__() for your keys with the goal of eliminating all collisions (perhaps using a perfect hash generator such as gperf ). 最后,您可以为您的键引入自己的自定义__hash __(),目的是消除所有冲突(可能使用完美的哈希生成器,如gperf )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM