简体   繁体   English

将字典保存到文件(numpy和Python 2/3友好)

[英]Saving dictionaries to file (numpy and Python 2/3 friendly)

I want to do hierarchical key-value storage in Python, which basically boils down to storing dictionaries to files. 我想在Python中进行分层键值存储,这基本上归结为将字典存储到文件中。 By that I mean any type of dictionary structure, that may contain other dictionaries, numpy arrays, serializable Python objects, and so forth. 我指的是任何类型的字典结构,可能包含其他字典,numpy数组,可序列化的Python对象等等。 Not only that, I want it to store numpy arrays space-optimized and play nice between Python 2 and 3. 不仅如此,我希望它存储空间优化的numpy数组,并在Python 2和3之间发挥很好的作用。

Below are methods I know are out there. 以下是我所知道的方法。 My question is what is missing from this list and is there an alternative that dodges all my deal-breakers? 我的问题是这个清单中缺少什么,是否有一个替代方案可以躲过我所有的交易破坏者?

  • Python's pickle module (deal-breaker: inflates the size of numpy arrays a lot) Python的pickle模块(交易破坏者:大量膨胀numpy数组的大小)
  • Numpy's save / savez / load (deal-breaker: Incompatible format across Python 2/3) Numpy的save / savez / load (交易破坏者:Python 2/3中不兼容的格式)
  • PyTables replacement for numpy.savez (deal-breaker: only handles numpy arrays) 替换numpy.savez的PyTables (交易破坏者:只处理numpy数组)
  • Using PyTables manually (deal-breaker: I want this for constantly changing research code, so it's really convenient to be able to dump dictionaries to files by calling a single function) 手动使用PyTables(交易破坏者:我希望这可以不断更改研究代码,因此通过调用单个函数将字典转储到文件非常方便)

The PyTables replacement of numpy.savez is promising, since I like the idea of using hdf5 and it compresses the numpy arrays really efficiently, which is a big plus. PyTables取代numpy.savez是有希望的,因为我喜欢使用hdf5的想法,它真的有效地压缩numpy数组,这是一个很大的优点。 However, it does not take any type of dictionary structure. 但是,它不需要任何类型的字典结构。

Lately, what I've been doing is to use something similar to the PyTables replacement, but enhancing it to be able to store any type of entries. 最近,我一直在做的是使用类似于PyTables替换的东西,但增强它以便能够存储任何类型的条目。 This actually works pretty well, but I find myself storing primitive data types in length-1 CArrays, which is a bit awkward (and ambiguous to actual length-1 arrays), even though I set chunksize to 1 so it doesn't take up that much space. 这实际上工作得很好,但我发现自己在长度为1的CArrays中存储原始数据类型,这有点尴尬(并且与实际长度为1的数组不一致),即使我将chunksize设置为1因此它不会占用那么多空间。

Is there something like that already out there? 那里有类似的东西吗?

Thanks! 谢谢!

I recently found myself with a similar problem, for which I wrote a couple of functions for saving the contents of dicts to a group in a PyTables file, and loading them back into dicts. 我最近发现自己遇到了类似的问题,我为此编写了一些函数,用于将dicts的内容保存到PyTables文件中的一个组,并将它们加载回dicts。

They process nested dictionary and group structures recursively, and handle objects with types that are not natively supported by PyTables by pickling them and storing them as string arrays. 它们以递归方式处理嵌套字典和组结构,并通过pickle并将它们存储为字符串数组来处理PyTables本身不支持的类型的对象。 It's not perfect, but at least things like numpy arrays will be stored efficiently. 它并不完美,但至少像numpy数组这样的东西会被有效存储。 There's also a check included to avoid inadvertently loading enormous structures into memory when reading the group contents back into a dict. 还有一个检查包括,以避免在将组内容重新读入字典时无意中将大量结构加载到内存中。

import tables
import cPickle

def dict2group(f, parent, groupname, dictin, force=False, recursive=True):
    """
    Take a dict, shove it into a PyTables HDF5 file as a group. Each item in
    the dict must have a type and shape compatible with PyTables Array.

    If 'force == True', any existing child group of the parent node with the
    same name as the new group will be overwritten.

    If 'recursive == True' (default), new groups will be created recursively
    for any items in the dict that are also dicts.
    """
    try:
        g = f.create_group(parent, groupname)
    except tables.NodeError as ne:
        if force:
            pathstr = parent._v_pathname + '/' + groupname
            f.removeNode(pathstr, recursive=True)
            g = f.create_group(parent, groupname)
        else:
            raise ne
    for key, item in dictin.iteritems():
        if isinstance(item, dict):
            if recursive:
                dict2group(f, g, key, item, recursive=True)
        else:
            if item is None:
                item = '_None'
            f.create_array(g, key, item)
    return g


def group2dict(f, g, recursive=True, warn=True, warn_if_bigger_than_nbytes=100E6):
    """
    Traverse a group, pull the contents of its children and return them as
    a Python dictionary, with the node names as the dictionary keys.

    If 'recursive == True' (default), we will recursively traverse child
    groups and put their children into sub-dictionaries, otherwise sub-
    groups will be skipped.

    Since this might potentially result in huge arrays being loaded into
    system memory, the 'warn' option will prompt the user to confirm before
    loading any individual array that is bigger than some threshold (default
    is 100MB)
    """

    def memtest(child, threshold=warn_if_bigger_than_nbytes):
        mem = child.size_in_memory
        if mem > threshold:
            print '[!] "%s" is %iMB in size [!]' % (child._v_pathname, mem / 1E6)
            confirm = raw_input('Load it anyway? [y/N] >>')
            if confirm.lower() == 'y':
                return True
            else:
                print "Skipping item \"%s\"..." % g._v_pathname
        else:
            return True
    outdict = {}
    for child in g:
        try:
            if isinstance(child, tables.group.Group):
                if recursive:
                    item = group2dict(f, child)
                else:
                    continue
            else:
                if memtest(child):
                    item = child.read()
                    if isinstance(item, str):
                        if item == '_None':
                            item = None
                else:
                    continue
            outdict.update({child._v_name: item})
        except tables.NoSuchNodeError:
            warnings.warn('No such node: "%s", skipping...' % repr(child))
            pass
    return outdict

It's also worth mentioning joblib.dump and joblib.load , which tick all of your boxes apart from Python 2/3 cross-compatibility. 还有一点值得一提的是joblib.dumpjoblib.load ,除了Python 2/3交叉兼容性之外,它joblib.load勾选你的所有方框。 Under the hood they use np.save for numpy arrays and cPickle for everything else. 在引擎盖下,他们使用np.save作为numpy数组,并使用cPickle作为其他所有内容。

After asking this two years ago, I starting coding my own HDF5-based replacement of pickle/ np.save . 在两年前问这个之后,我开始编写自己的基于HDF5的pickle / np.save替换。 Ever since, it has matured into a stable package, so I thought I would finally answer and accept my own question because it is by design exactly what I was looking for: 从那以后,它已经成熟为一个稳定的包装,所以我想我最终会回答并接受我自己的问题,因为它的设计正是我所寻找的:

I tried playing with np.memmap for saving an array of dictionaries. 我尝试使用np.memmap来保存字典数组。 Say we have the dictionary: 说我们有字典:

a = np.array([str({'a':1, 'b':2, 'c':[1,2,3,{'d':4}]}])

first I tried to directly save it to a memmap : 首先我尝试将其直接保存到memmap

f = np.memmap('stack.array', dtype=dict, mode='w+', shape=(100,))
f[0] = d
# CRASHES when reopening since it looses the memory pointer

f = np.memmap('stack.array', dtype=object, mode='w+', shape=(100,))
f[0] = d
# CRASHES when reopening for the same reason

the way it worked is converting the dictionary to a string: 它的工作方式是将字典转换为字符串:

f = np.memmap('stack.array', dtype='|S1000', mode='w+', shape=(100,))
f[0] = str(a)

this works and afterwards you can eval(f[0]) to get the value back. 这是有效的,然后你可以eval(f[0])来获取值。

I do not know the advantage of this approach over the others, but it deserves a closer look. 我不知道这种方法相对于其他方法的优势,但值得仔细研究。

I absolutely recommend a python object database like ZODB . 我绝对推荐像ZODB这样的python对象数据库。 It seems pretty well suited for your situation, considering you store objects (literally whatever you like) to a dictionary - this means you can store dictionaries inside dictionaries. 考虑到你将对象(字面意思是你喜欢的东西)存储到字典中,这似乎非常适合你的情况 - 这意味着你可以在字典中存储字典。 I've used it in a wide range of problems, and the nice thing is that you can just hand somebody the database file (the one with a .fs extension). 我已经在各种各样的问题中使用过它,好的是你可以把数据库文件(扩展名为.fs的文件)​​交给某人。 With this, they'll be able to read it in, and perform any queries they wish, and modify their own local copies. 有了这个,他们将能够阅读它,并执行他们希望的任何查询,并修改自己的本地副本。 If you wish to have multiple programs simultaneously accessing the same database, I'd make sure to look at ZEO . 如果您希望多个程序同时访问同一个数据库,我一定要查看ZEO

Just a silly example of how to get started: 这是一个如何开始的愚蠢的例子:

from ZODB import DB
from ZODB.FileStorage import FileStorage
from ZODB.PersistentMapping import PersistentMapping
import transaction
from persistent import Persistent
from persistent.dict import PersistentDict
from persistent.list import PersistentList

# Defining database type and creating connection.
storage = FileStorage('/path/to/database/zodbname.fs') 
db = DB(storage)
connection = db.open()
root = connection.root()

# Define and populate the structure.
root['Vehicle'] = PersistentDict() # Upper-most dictionary
root['Vehicle']['Tesla Model S'] = PersistentDict() # Object 1 - also a dictionary
root['Vehicle']['Tesla Model S']['range'] = "208 miles"
root['Vehicle']['Tesla Model S']['acceleration'] = 5.9
root['Vehicle']['Tesla Model S']['base_price'] = "$71,070"
root['Vehicle']['Tesla Model S']['battery_options'] = ["60kWh","85kWh","85kWh Performance"]
# more attributes here

root['Vehicle']['Mercedes-Benz SLS AMG E-Cell'] = PersistentDict() # Object 2 - also a dictionary
# more attributes here

# add as many objects with as many characteristics as you like.

# commiting changes; up until this point things can be rolled back
transaction.get().commit()
transaction.get().abort()
connection.close()
db.close()
storage.close()

Once the database is created it's very easy use. 一旦创建了数据库,它就很容易使用。 Since it's an object database (a dictionary), you can access objects very easily: 由于它是一个对象数据库(字典),因此您可以非常轻松地访问对象:

#after it's opened (lines from the very beginning, up to and including root = connection.root() )
>> root['Vehicles']['Tesla Model S']['range'] 
'208 miles'

You can also display all of the keys (and do all other standard dictionary things you might want to do): 您还可以显示所有键(并执行您可能想要执行的所有其他标准词典):

>> root['Vehicles']['Tesla Model S'].keys()
['acceleration', 'range', 'battery_options', 'base_price']

Last thing I want to mention is that keys can be changed: Changing the key value in python dictionary . 我要提到的最后一件事是可以更改密钥更改python字典中的键值 Values can also be changed - so if your research results change because you change your method or something you don't have to start the entire database from scratch (especially if everything else is still okay). 值也可以更改 - 因此,如果您的研究结果发生变化,因为您更改了方法或某些内容,则无需从头开始创建整个数据库(特别是如果其他所有内容仍然正常)。 Be careful with doing both of these. 做这两件都要小心。 I put in safety measures in my database code to make sure I'm aware of my attempts to overwrite keys or values. 我在我的数据库代码中加入了安全措施,以确保我知道我试图覆盖密钥或值。

** ADDED ** ** 添加 **

# added imports
import numpy as np
from tempfile import TemporaryFile
outfile = TemporaryFile()

# insert into definition/population section
np.save(outfile,np.linspace(-1,1,10000))
root['Vehicle']['Tesla Model S']['arraydata'] = outfile

# check to see if it worked
>>> root['Vehicle']['Tesla Model S']['arraydata']
<open file '<fdopen>', mode 'w+b' at 0x2693db0>

outfile.seek(0)# simulate closing and re-opening
A = np.load(root['Vehicle']['Tesla Model S']['arraydata'])

>>> print A
array([-1.        , -0.99979998, -0.99959996, ...,  0.99959996,
    0.99979998,  1.        ])

You could also use numpy.savez() for compressed saving of multiple numpy arrays in this exact same way. 您也可以使用numpy.savez()以完全相同的方式压缩保存多个numpy数组。

This is not a direct answer. 这不是一个直接的答案。 Anyway, you may be interested also in JSON. 无论如何,您可能也对JSON感兴趣。 Have a look at the 13.10. 看看13.10。 Serializing Datatypes Unsupported by JSON . 序列化JSON不支持的数据类型 It shows how to extend the format for unsuported types. 它显示了如何扩展未输出类型的格式。

The whole chapter from "Dive into Python 3" by Mark Pilgrim is definitely a good read for at least to know... Mark Pilgrim的“潜入Python 3”这一章绝对是一个很好的阅读,至少知道......

Update: Possibly an unrelated idea, but... I have read somewhere, that one of the reasons why XML was finally adopted for data exchange in heterogeneous environment was some study that compared specialized binary format with zipped XML. 更新:可能是一个无关的想法,但是......我已经在某处读过,在异构环境中最终采用XML进行数据交换的原因之一是将专用二进制格式与压缩XML进行比较的一些研究。 The conclusion for you could be to use possibly not so space efficient solution and compress it via zip or another well known algorithm. 您的结论可能是使用可能不那么节省空间的解决方案并通过zip或其他众所周知的算法压缩它。 Using the known algorithm helps when you need to debug (to unzip and then look at the text file by eye). 使用已知算法有助于您需要调试(解压缩然后通过眼睛查看文本文件)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM