简体   繁体   English

使用python pickle加载大字典

[英]Loading a large dictionary using python pickle

I have a full inverted index in form of nested python dictionary. 我有一个嵌套python字典形式的完整倒排索引。 Its structure is : 其结构是:

{word : { doc_name : [location_list] } }

For example let the dictionary be called index, then for a word " spam ", entry would look like : 例如,让字典称为索引,然后对于单词“spam”,条目看起来像:

{ spam : { doc1.txt : [102,300,399], doc5.txt : [200,587] } }

I used this structure as python dict are pretty optimised and it makes programming easier. 我使用这个结构,因为python dict非常优化,它使编程更容易。

for any word 'spam', the documents containig it can be given by : 对于任何“垃圾邮件”这个词,包含它的文件可以通过以下方式给出:

index['spam'].keys()

and posting list for a document doc1 by: 并通过以下方式发布文档doc1的列表:

index['spam']['doc1']

At present I am using cPickle to store and load this dictionary. 目前我正在使用cPickle来存储和加载这本字典。 But the pickled file is around 380 MB and takes a long time to load - 112 seconds(approx. I timed it using time.time() ) and memory usage goes to 1.2 GB (Gnome system monitor). 但是腌制文件大约是380 MB并且需要很长时间才能加载 - 112秒(大约我使用time.time()定时)并且内存使用量达到1.2 GB(Gnome系统监视器)。 Once it loads, its fine. 一旦它加载,它的罚款。 I have 4GB RAM. 我有4GB内存。

len(index.keys()) gives 229758 len(index.keys())给出229758

Code

import cPickle as pickle

f = open('full_index','rb')
print 'Loading index... please wait...'
index = pickle.load(f)  # This takes ages
print 'Index loaded. You may now proceed to search'

How can I make it load faster? 如何让它加载更快? I only need to load it once, when the application starts. 我只需要在应用程序启动时加载一次。 After that, the access time is important to respond to queries. 之后,访问时间对于响应查询很重要。

Should I switch to a database like SQLite and create an index on its keys? 我应该切换到像SQLite这样的数据库并在其键上创建索引吗? If yes, how do I store the values to have an equivalent schema, which makes retrieval easy. 如果是,我如何存储值以具有等效模式,这使得检索变得容易。 Is there anything else that I should look into ? 还有什么我应该研究的吗?

Addendum 附录

Using Tim's answer pickle.dump(index, file, -1) the pickled file is considerably smaller - around 237 MB (took 300 seconds to dump)... and takes half the time to load now (61 seconds ... as opposed to 112 s earlier .... time.time() ) 使用Tim的答案pickle.dump(index, file, -1) ,pickle文件相当小 - 大约237 MB(转储需要300秒)......现在需要一半的时间来加载(61秒......相反)到112秒之前.... time.time()

But should I migrate to a database for scalability ? 但是我应该迁移到数据库以获得可伸缩性吗?

As for now I am marking Tim's answer as accepted. 至于现在,我正在接受蒂姆的回答。

PS :I don't want to use Lucene or Xapian ... This question refers Storing an inverted index . PS:我不想使用Lucene或Xapian ......这个问题涉及存储倒排索引 I had to ask a new question because I wasn't able to delete the previous one. 我不得不问一个新问题,因为我无法删除前一个问题。

Try the protocol argument when using cPickle.dump / cPickle.dumps . 使用cPickle.dump / cPickle.dumps时尝试使用protocol参数。 From cPickle.Pickler.__doc__ : 来自cPickle.Pickler.__doc__

Pickler(file, protocol=0) -- Create a pickler. Pickler(文件,协议= 0) - 创建一个pickler。

This takes a file-like object for writing a pickle data stream. 这需要一个类似文件的对象来编写pickle数据流。 The optional proto argument tells the pickler to use the given protocol; 可选的proto参数告诉pickler使用给定的协议; supported protocols are 0, 1, 2. The default protocol is 0, to be backwards compatible. 支持的协议为0,1,2。默认协议为0,向后兼容。 (Protocol 0 is the only protocol that can be written to a file opened in text mode and read back successfully. When using a protocol higher than 0, make sure the file is opened in binary mode, both when pickling and unpickling.) (协议0是唯一可以写入以文本模式打开并成功读回的文件的协议。当使用高于0的协议时,确保文件在二进制模式下打开,无论是在酸洗还是取消打开时。)

Protocol 1 is more efficient than protocol 0; 协议1比协议0更有效; protocol 2 is more efficient than protocol 1. 协议2比协议1更有效。

Specifying a negative protocol version selects the highest protocol version supported. 指定否定协议版本会选择支持的最高协议版本。 The higher the protocol used, the more recent the version of Python needed to read the pickle produced. 协议使用的越高,读取生成的pickle所需的Python版本就越新。

The file parameter must have a write() method that accepts a single string argument. file参数必须具有接受单个字符串参数的write()方法。 It can thus be an open file object, a StringIO object, or any other custom object that meets this interface. 因此,它可以是打开的文件对象,StringIO对象或满足此接口的任何其他自定义对象。

Converting JSON or YAML will probably take longer than pickling most of the time - pickle stores native Python types. 转换JSON或YAML可能需要比大多数时间的酸洗更长的时间 - pickle存储本机Python类型。

Do you really need it to load all at once? 你真的需要它一次加载吗? If you don't need all of it in memory, but only the select parts you want at any given time, you may want to map your dictionary to a set of files on disk instead of a single file… or map the dict to a database table. 如果您不需要内存中的所有内容,但只需要在任何给定时间选择所需的部分,您可能希望将字典映射到磁盘上的一组文件而不是单个文件...或将字典映射到a数据库表。 So, if you are looking for something that saves large dictionaries of data to disk or to a database, and can utilize pickling and encoding (codecs and hashmaps), then you might want to look at klepto . 因此,如果您正在寻找能够将大型数据字典保存到磁盘或数据库的东西,并且可以利用酸洗和编码(编解码器和散列图),那么您可能需要查看klepto

klepto provides a dictionary abstraction for writing to a database, including treating your filesystem as a database (ie writing the entire dictionary to a single file, or writing each entry to it's own file). klepto提供了一个用于写入数据库的字典抽象,包括将文件系统视为数据库(即将整个字典写入单个文件,或将每个条目写入其自己的文件)。 For large data, I often choose to represent the dictionary as a directory on my filesystem, and have each entry be a file. 对于大数据,我经常选择将字典表示为我的文件系统上的目录,并将每个条目都作为文件。 klepto also offers caching algorithms, so if you are using a filesystem backend for the dictionary you can avoid some speed penalty by utilizing memory caching. klepto还提供缓存算法,因此如果您使用字典的文件系统后端,则可以通过利用内存缓存来避免一些速度损失。

>>> from klepto.archives import dir_archive
>>> d = {'a':1, 'b':2, 'c':map, 'd':None}
>>> # map a dict to a filesystem directory
>>> demo = dir_archive('demo', d, serialized=True) 
>>> demo['a']
1
>>> demo['c']
<built-in function map>
>>> demo          
dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True)
>>> # is set to cache to memory, so use 'dump' to dump to the filesystem 
>>> demo.dump()
>>> del demo
>>> 
>>> demo = dir_archive('demo', {}, serialized=True)
>>> demo
dir_archive('demo', {}, cached=True)
>>> # demo is empty, load from disk
>>> demo.load()
>>> demo
dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True)
>>> demo['c']
<built-in function map>
>>> 

klepto also has other flags such as compression and memmode that can be used to customize how your data is stored (eg compression level, memory map mode, etc). klepto还有其他标志,如compressionmemmode ,可用于自定义数据的存储方式(例如压缩级别,内存映射模式等)。 It's equally easy (the same exact interface) to use a (MySQL, etc) database as a backend instead of your filesystem. 使用(MySQL等)数据库作为后端而不是文件系统同样容易(相同的界面)。 You can also turn off memory caching, so every read/write goes directly to the archive, simply by setting cached=False . 您还可以关闭内存缓存,因此每次读/写都可以直接进入存档,只需设置cached=False

klepto provides access to customizing your encoding, by building a custom keymap . klepto通过构建自定义键keymap提供了自定义编码的权限。

>>> from klepto.keymaps import *
>>> 
>>> s = stringmap(encoding='hex_codec')
>>> x = [1,2,'3',min]
>>> s(x)
'285b312c20322c202733272c203c6275696c742d696e2066756e6374696f6e206d696e3e5d2c29'
>>> p = picklemap(serializer='dill')
>>> p(x)
'\x80\x02]q\x00(K\x01K\x02U\x013q\x01c__builtin__\nmin\nq\x02e\x85q\x03.'
>>> sp = s+p
>>> sp(x)
'\x80\x02UT28285b312c20322c202733272c203c6275696c742d696e2066756e6374696f6e206d696e3e5d2c292c29q\x00.' 

klepto also provides a lot of caching algorithms (like mru , lru , lfu , etc), to help you manage your in-memory cache, and will use the algorithm do the dump and load to the archive backend for you. klepto还提供了许多缓存算法(如mrulrulfu等),以帮助您管理内存缓存,并将使用算法执行转储并加载到归档后端。

You can use the flag cached=False to turn off memory caching completely, and directly read and write to and from disk or database. 您可以使用标志cached=False完全关闭内存缓存,并直接读写磁盘或数据库。 If your entries are large enough, you might pick to write to disk, where you put each entry in it's own file. 如果您的条目足够大,您可以选择写入磁盘,将每个条目放在其自己的文件中。 Here's an example that does both. 这是两个例子。

>>> from klepto.archives import dir_archive
>>> # does not hold entries in memory, each entry will be stored on disk
>>> demo = dir_archive('demo', {}, serialized=True, cached=False)
>>> demo['a'] = 10
>>> demo['b'] = 20
>>> demo['c'] = min
>>> demo['d'] = [1,2,3]

However while this should greatly reduce load time, it might slow overall execution down a bit… it's usually better to specify the maximum amount to hold in memory cache and pick a good caching algorithm. 然而,虽然这应该会大大减少加载时间,但它可能会使整体执行速度降低一些......通常最好指定内存缓存中保留的最大数量并选择一个好的缓存算法。 You have to play with it to get the right balance for your needs. 您必须使用它才能获得满足您需求的正确平衡。

Get klepto here: https://github.com/uqfoundation 在这里获取kleptohttps//github.com/uqfoundation

A common pattern in Python 2.x is to have one version of a module implemented in pure Python, with an optional accelerated version implemented as a C extension; Python 2.x中的一个常见模式是在纯Python中实现一个模块版本,并将可选的加速版本实现为C扩展; for example, pickle and cPickle . 例如, picklecPickle This places the burden of importing the accelerated version and falling back on the pure Python version on each user of these modules. 这会导致导入加速版本的负担,并在这些模块的每个用户上回退到纯Python版本。 In Python 3.0 , the accelerated versions are considered implementation details of the pure Python versions. 在Python 3.0中 ,加速版本被认为是纯Python版本的实现细节。 Users should always import the standard version, which attempts to import the accelerated version and falls back to the pure Python version. 用户应始终导入标准版本,该版本尝试导入加速版本并回退到纯Python版本。 The pickle / cPickle pair received this treatment. pickle / cPickle对接受了这种治疗。

  • Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python. 协议版本0是原始的“人类可读”协议,并且向后兼容早期版本的Python。
  • Protocol version 1 is an old binary format which is also compatible with earlier versions of Python. 协议版本1是旧的二进制格式,它也与早期版本的Python兼容。
  • Protocol version 2 was introduced in Python 2.3. 在Python 2.3中引入了协议版本2。 It provides much more efficient pickling of new-style classes. 它提供了更有效的新式类型的酸洗。 Refer to PEP 307 for information about improvements brought by protocol 2. 有关协议2带来的改进的信息,请参阅PEP 307。
  • Protocol version 3 was added in Python 3.0. 在Python 3.0中添加了协议版本3 It has explicit support for bytes objects and cannot be unpickled by Python 2.x. 它具有对字节对象的显式支持,并且不能被Python 2.x打开。 This is the default protocol, and the recommended protocol when compatibility with other Python 3 versions is required. 这是默认协议,需要与其他Python 3版本兼容时的推荐协议。
  • Protocol version 4 was added in Python 3.4. 在Python 3.4中添加了协议版本4。 It adds support for very large objects , pickling more kinds of objects, and some data format optimizations. 它增加了对非常大的对象的支持 ,挑选更多种类的对象,以及一些数据格式优化。 Refer to PEP 3154 for information about improvements brought by protocol 4. 有关协议4带来的改进的信息,请参阅PEP 3154

If your dictionary is huge and should only be compatible with Python 3.4 or higher, use: 如果您的字典很大并且只应与Python 3.4或更高版本兼容,请使用:

pickle.dump(obj, file, protocol=4)
pickle.load(file, encoding="bytes")

or: 要么:

Pickler(file, 4).dump(obj)
Unpickler(file).load()

That said, in 2010 the json module was 25 times faster at encoding and 15 times faster at decoding simple types than pickle . 这就是说, 在2010年json模块是快25倍的编码和在比解码简单类型快15倍pickle My 2014 benchmark says marshal > pickle > json , but marshal's coupled to specific Python versions . 我的2014年基准测试marshal > pickle > json ,但是marshal's与特定的Python版本相结合

Have you tried using an alternative storage format such as YAML or JSON ? 您是否尝试过使用其他存储格式,例如YAMLJSON Python supports JSON natively from Python 2.6 using the json module I think, and there are third party modules for YAML . 我认为Python使用json模块本地支持来自Python 2.6的json ,并且YAML还有第三方模块

You may also try the shelve module. 您也可以尝试shelve模块。

Dependend on how long is 'long' you have to think about the trade-offs you have to make: either have all data ready in memory after (long) startup, or load only partial data (then you need to split up the date in multiple files or use SQLite or something like this). 取决于你需要多长时间才能考虑你需要做出的权衡:要么在(长期)启动后在内存中准备好所有数据,要么只加载部分数据(然后你需要将日期分成两部分)多个文件或使用SQLite或类似的东西)。 I doubt that loading all data upfront from eg sqlite into a dictionary will bring any improvement. 我怀疑将所有数据从例如sqlite加载到字典中会带来任何改进。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM