If I am storing a large directory as a pickle
file, does loading it via cPickle
mean that it will all be consumed into memory at once?
If so, is there a cross platform way to get something like pickle
, but access each entry one key at a item (ie avoid loading all of the dictionary into memory and only load each entry by name)? I know shelve
is supposed to do this: is that as portable as pickle
though?
I know shelve is supposed to do this: is that as portable as pickle though?
Yes. shelve
is part of The Python Standard Library and is written in Python.
So if you have a large dictionary:
bigd = {'a': 1, 'b':2, # . . .
}
And you want to save it without having to read the whole thing in later then don't save it as a pickle, it would be better to save it as a shelf, a sort of on disk dictionary.
import shelve
myShelve = shelve.open('my.shelve')
myShelve.update(bigd)
myShelve.close()
Then later you can:
import shelve
myShelve = shelve.open('my.shelve')
value = myShelve['a']
value += 1
myShelve['a'] = value
You basically treat the shelve object like a dict, but the items are stored on disk (as individual pickles) and read in as needed.
If your objects could be stored as a list of properties, then sqlite may be a good alternative. Shelves and pickles are convenient, but can only be accessed by Python, but a sqlite database can by read from most languages.
If you want a module that's more robust than shelve
, you might look at klepto
. klepto
is built to provide a dictionary interface to platform-agnostic storage on disk or database, and is built to work with large data.
Here, we first create some pickled objects stored on disk. They use the dir_archive
, which stores one object per file.
>>> d = dict(zip('abcde',range(5)))
>>> d['f'] = max
>>> d['g'] = lambda x:x**2
>>>
>>> import klepto
>>> help(klepto.archives.dir_archive)
>>> print klepto.archives.dir_archive.__new__.__doc__
initialize a dictionary with a file-folder archive backend
Inputs:
name: name of the root archive directory [default: memo]
dict: initial dictionary to seed the archive
cached: if True, use an in-memory cache interface to the archive
serialized: if True, pickle file contents; otherwise save python objects
compression: compression level (0 to 9) [default: 0 (no compression)]
memmode: access mode for files, one of {None, 'r+', 'r', 'w+', 'c'}
memsize: approximate size (in MB) of cache for in-memory compression
>>> a = klepto.archives.dir_archive(dict=d)
>>> a
dir_archive('memo', {'a': 0, 'c': 2, 'b': 1, 'e': 4, 'd': 3, 'g': <function <lambda> at 0x102f562a8>, 'f': <built-in function max>}, cached=True)
>>> a.dump()
>>> del a
Now, the data is all on disk, let's pick and choose which ones we want to load in to memory. b
is the dict in memory, while b.archive
maps the collection of files into a dictionary view.
>>> b = klepto.archives.dir_archive('memo')
>>> b
dir_archive('memo', {}, cached=True)
>>> b.keys()
[]
>>> b.archive.keys()
['a', 'c', 'b', 'e', 'd', 'g', 'f']
>>> b.load('a')
>>> b
dir_archive('memo', {'a': 0}, cached=True)
>>> b.load('b')
>>> b.load('f')
>>> b.load('g')
>>> b['g'](b['f'](b['a'],b['b']))
1
klepto
also provides the same interface to a sql
archive.
>>> print klepto.archives.sql_archive.__new__.__doc__
initialize a dictionary with a sql database archive backend
Connect to an existing database, or initialize a new database, at the
selected database url. For example, to use a sqlite database 'foo.db'
in the current directory, database='sqlite:///foo.db'. To use a mysql
database 'foo' on localhost, database='mysql://user:pass@localhost/foo'.
For postgresql, use database='postgresql://user:pass@localhost/foo'.
When connecting to sqlite, the default database is ':memory:'; otherwise,
the default database is 'defaultdb'. If sqlalchemy is not installed,
storable values are limited to strings, integers, floats, and other
basic objects. If sqlalchemy is installed, additional keyword options
can provide database configuration, such as connection pooling.
To use a mysql or postgresql database, sqlalchemy must be installed.
Inputs:
name: url for the sql database [default: (see note above)]
dict: initial dictionary to seed the archive
cached: if True, use an in-memory cache interface to the archive
serialized: if True, pickle table contents; otherwise cast as strings
>>> c = klepto.archives.sql_archive('database')
>>> c.update(b)
>>> c
sql_archive('sqlite:///database', {'a': 0, 'b': 1, 'g': <function <lambda> at 0x10446b1b8>, 'f': <built-in function max>}, cached=True)
>>> c.dump()
Where now the same objects on disk are also in a sql archive. We can add new objects to either archive.
>>> b['x'] = 69
>>> c['y'] = 96
>>> b.dump('x')
>>> c.dump('y')
Get klepto
here: https://github.com/uqfoundation
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.