简体   繁体   English

在Python中保存和加载大型字典的最快方法

[英]Fastest way to save and load a large dictionary in Python

I have a relatively large dictionary. 我有一本相对较大的字典。 How do I know the size? 我怎么知道尺寸? well when I save it using cPickle the size of the file will grow approx. 好吧,当我使用cPickle保存它时,文件的大小将增加约。 400Mb. 400MB。 cPickle is supposed to be much faster than pickle but loading and saving this file just takes a lot of time. cPickle应该比pickle快得多,但加载和保存这个文件需要花费很多时间。 I have a Dual Core laptop 2.6 Ghz with 4GB RAM on a Linux machine. 我在Linux机器上有一台带有4GB RAM的双核笔记本电脑2.6 Ghz。 Does anyone have any suggestions for a faster saving and loading of dictionaries in python? 有没有人有任何建议在python中更快地保存和加载字典? thanks 谢谢

Use the protocol=2 option of cPickle. 使用cPickle的protocol = 2选项 The default protocol (0) is much slower, and produces much larger files on disk. 默认协议(0)要慢得多,并在磁盘上生成更大的文件。

If you just want to work with a larger dictionary than memory can hold, the shelve module is a good quick-and-dirty solution. 如果你只想使用比内存更大的字典, 搁置模块是一个很好的快速和肮脏的解决方案。 It acts like an in-memory dict, but stores itself on disk rather than in memory. 它就像一个内存中的字典,但是它存储在磁盘而不是内存中。 shelve is based on cPickle, so be sure to set your protocol to anything other than 0. shelve基于cPickle,因此请务必将协议设置为0以外的任何其他协议。

The advantages of a database like sqlite over cPickle will depend on your use case. sqlite这样的数据库优于cPickle的优势将取决于你的用例。 How often will you write data? 你多久写一次数据? How many times do you expect to read each datum that you write? 您希望读取您编写的每个数据多少次? Will you ever want to perform a search of the data you write, or load it one piece at a time? 您是否想要搜索您编写的数据,或者一次加载一个?

If you're doing write-once, read-many, and loading one piece at a time, by all means use a database. 如果您正在进行一次写入,多次读取,并且一次加载一个,则一定要使用数据库。 If you're doing write once, read once, cPickle (with any protocol other than the default protocol=0) will be hard to beat. 如果你正在写一次,读一次,cPickle(使用除默认协议以外的任何协议= 0)将很难被击败。 If you just want a large, persistent dict, use shelve. 如果你只想要一个大而持久的字典,请使用搁置。

Sqlite 源码

It might be worthwhile to store the data in a Sqlite database. 将数据存储在Sqlite数据库中可能是值得的。 Although there will be some development overhead when refactoring your program to work with Sqlite, it also becomes much easier and performant to query the database. 虽然在重构程序以使用Sqlite时会有一些开发开销,但查询数据库也变得更加容易和高效。

You also get transactions, atomicity, serialization, compression, etc. for free. 您还可以免费获得事务,原子性,序列化,压缩等。

Depending on what version of Python you're using, you might already have sqlite built-in. 根据您使用的Python版本,您可能已经内置了sqlite。

I know it's an old question but just as an update for those who still looking for an answer to this question: The protocol argument has been updated in python 3 and now there are even faster and more efficient options (ie protocol=3 and protocol=4 ) which might not work under python 2. You can read about it more in the reference . 我知道这是一个老问题,但对于那些仍在寻找这个问题答案的人来说还是一个更新: protocol参数已在python 3中更新,现在有更快更有效的选项(即protocol=3protocol=4 )在python 2下可能无法正常工作。您可以在参考中更多地阅读它。

In order to always use the best protocol supported by the python version you're using, you can simply use pickle.HIGHEST_PROTOCOL . 为了始终使用您正在使用的python版本支持的最佳协议,您只需使用pickle.HIGHEST_PROTOCOL The following example is taken from the reference : 以下示例取自参考

import pickle
# ...
with open('data.pickle', 'wb') as f:
    # Pickle the 'data' dictionary using the highest protocol available.
    pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)

您可以测试压缩您的字典(有一些限制请参阅: 这篇文章 )如果磁盘访问是瓶颈,它将是高效的。

That is a lot of data... What kind of contents has your dictionary? 那是很多数据......你的词典里面有什么样的内容? If it is only primitive or fixed datatypes, maybe a real database or a custom file-format is the better option? 如果它只是原始数据类型或固定数据类型,那么真正的数据库或自定义文件格式是更好的选择吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM