酸洗包含大型 numpy 数组的对象

Question

I'm pickling an object which has the following structure:我正在酸洗一个具有以下结构的对象：

obj
  |---metadata
  |---large numpy array

I'd like to be able to access the metadata.我希望能够访问元数据。 However, if I pickle.load() the object and iterate over a directory (say because I'm looking for some specific metadata to determine which one to return), then it gets lenghty.但是，如果我 pickle.load() 对象并遍历一个目录（比如说因为我正在寻找一些特定的元数据来确定要返回哪个元数据），那么它就会变得很长。 I'm guessing pickle wants to load, well, the whole object.我猜泡菜想要加载整个对象。

Is there a way to access only the top-level metadata of the object without having to load the whole thing?有没有办法只访问对象的顶级元数据而不必加载整个内容？

I thought about maintaining an index, but then it means I have to implement the logic of it and keep it current, which I'd rather avoid if there's an simpler solution....我想过维护一个索引，但这意味着我必须实现它的逻辑并使其保持最新状态，如果有更简单的解决方案，我宁愿避免这种情况......

Answer 1

Yes ordinary pickle will load everything.是的，普通泡菜会加载所有东西。 In Python 3.8, the new Pickle protocol allows one to control how objects are serialized and use a side-channel for the large part of the data, but that is mainly useful when using pickle in inter-process communication.在 Python 3.8 中，新的 Pickle 协议允许控制对象的序列化方式并对大部分数据使用侧通道，但这主要在进程间通信中使用 pickle 时有用。 That would require a custom implementation of the pickling for your objects.这将需要为您的对象自定义酸洗实现。

However, even with older Python versions it is possible to customize how to serialize your objects to disk.但是，即使使用较旧的 Python 版本，也可以自定义如何将对象序列化到磁盘。

For example, instead of having your arrays as ordinary members of your objects, you could have them "living" in another data structure - say, a dictionary, and implement data-access to your arrays indirectly, through that dictionary.例如，不是将数组作为对象的普通成员，而是可以让它们“存在”在另一个数据结构中 - 例如字典，并通过该字典间接实现对数组的数据访问。

In Python versions 3.8, this will require you to "cheat" on the pickle-customization, in the sense that upon serialization of your object, the custom method should save the separate data as a side-effect.在 Python 3.8 版中，这将要求您在 pickle-customization 上“作弊”，因为在序列化您的对象时，自定义方法应将单独的数据保存为副作用。 But other than that, it should be straight forward.但除此之外，它应该是直截了当的。

In more concrete terms, when you have something like:更具体地说，当您有以下内容时：


class MyObject:
     def __init__(self, data: NP.NDArray, meta_data: any):
            self.data = data
            self.meta_data = meta_data

Augment it this way - you should be still good to do whatever you do with your objects, but pickling now will only picke the metadata - the numpy arrays will "live" in a separate data structure that won't be automatically serialized:以这种方式增强它 - 您应该仍然可以对对象执行任何操作，但是现在酸洗只会挑选元数据 - numpy 数组将“存在”在不会自动序列化的单独数据结构中：


from uuid import uuid4

VAULT = dict()

class SeparateSerializationDescriptor:
    def __set_name__(self, owner, name):
        self.name = name

    def __set__(self, instance, value):
        id = instance.__dict__[self.name] = str(uuid4())
        VAULT[id] = value

    def __get__(self, instance, owner):
        if instance is None:
            return self
        return VAULT[instance.__dict__[self.name]]

    def __delete__(self, instance):
        del VAULT[instance.__dict__[self.name]]
        del instance.__dict__[self.name]

class MyObject:

    data = SeparateSerializationDescriptor()

    def __init__(self, data: NP.NDArray, meta_data: any):
        self.data = data
        self.meta_data = meta_data

Really -that is all that is needed to customize the attribute access: all ordinary uses of the self.data attribute will retrieve the original numpy array seamlessly - self.data[0:10] will just work.真的 - 这就是自定义属性访问所需的全部内容： self.data属性的所有普通用途都将无缝检索原始 numpy 数组 - self.data[0:10]将正常工作。 But pickle, at this point, will retrieve the contents of the instance's __dict__ - which only contain a key to the real data in the "vault" object.但是此时，pickle 将检索实例的__dict__的内容——它只包含“Vault”对象中真实数据的键。

Besides allowing you to serialize the metadata and data in separated files, it also allows you a fine-grained of the data in memory, by manipulating the "VAULT".除了允许您序列化分离文件中的元数据和数据外，它还允许您通过操作“VAULT”来细化内存中的数据。

And now, customize the pickling of the class so that it will save the data separatly to disk, and retrieve it on reading.现在，自定义类的酸洗，以便它将数据单独保存到磁盘，并在读取时检索它。 On Python 3.8, this probably can be done "within the rules" (I will take the time, since I am answering this, to take a lookg at that).在 Python 3.8 上，这可能可以“在规则范围内”完成（我会花时间，因为我正在回答这个问题，看看那个）。 For tradiciotnal pickle, we "break the rules" in which we save the extra data to disk, and load it, as side-effects of serialization.对于传统的泡菜，我们“打破了规则”，将额外的数据保存到磁盘并加载它，作为序列化的副作用。

Actually, it just occurred me that ordinarily customizing the methods used directly by the pickle protocol , like __reduce_ex__ and __setstate__ while would work, would, again, automatically unpickle the whole object from disk.实际上，我只是__reduce_ex__ __setstate__ ，通常自定义pickle 协议直接使用的方法，例如__reduce_ex__和__setstate__ while 会起作用，并且会再次自动从磁盘中__setstate__整个对象。

A way to go is: upon serialization, save the full data in a separate file, and create some more metadata so that the array-file can be found.一种方法是：在序列化时，将完整数据保存在单独的文件中，并创建更多元数据，以便可以找到数组文件。 Upon desserialization, always load only the metadata - and build into the descriptor above a mechanism to lazy load the arrays as needed.在反序列化时，始终只加载元数据 - 并在描述符上方构建一种机制，以根据需要延迟加载数组。

So, we provide a Mixin class, and its dump method should be called instead of pickle.dump - so the data is written in separate files.所以，我们提供了一个 Mixin 类，它的dump方法应该被调用而不是pickle.dump - 所以数据被写在单独的文件中。 To unpickle the object, use Python's pickle.load normally: it will retrieve only the "normal" attributes of the object.要取消对象， pickle.load正常使用 Python 的pickle.load ：它将仅检索对象的“正常”属性。 The object's .load() method can be then be called explicitly to load all the arrays, or it will be called automatically when the data is first accessed, in a lazy way:然后可以显式调用对象的.load()方法以加载所有数组，或者在第一次访问数据时自动调用它，以惰性方式：

import pathlib
from uuid import uuid4
import pickle

VAULT = dict()

class SeparateSerializationDescriptor:
    def __set_name__(self, owner, name):
        self.name = name

    def __set__(self, instance, value):
        id = instance.__dict__[self.name] = str(uuid4())
        VAULT[id] = value

    def __get__(self, instance, owner):
        if instance is None:
            return self
        try:
            return VAULT[instance.__dict__[self.name]]
        except KeyError:
            # attempt so silently load missing data from disk upon first array access after unpickling:
            instance.load()
            return VAULT[instance.__dict__[self.name]]

    def __delete__(self, instance):
        del VAULT[instance.__dict__[self.name]]
        del instance.__dict__[self.name]


class SeparateSerializationMixin:

    def _iter_descriptors(self, data_dir):

        for attr in self.__class__.__dict__.values():
            if not isinstance(attr, SeparateSerializationDescriptor):
                continue
            id = self.__dict__[attr.name]
            if not data_dir:
                # use saved absolute path instead of passed in folder
                data_path = pathlib.Path(self.__dict__[attr.name + "_path"])
            else:
                data_path = data_dir / (id + ".pickle")
            yield attr, id, data_path

    def dump(self, file, protocol=None, **kwargs):
        data_dir = pathlib.Path(file.name).absolute().parent

        # Annotate paths and pickle all numpyarrays into separate files:
        for attr, id, data_path in self._iter_descriptors(data_dir):
            self.__dict__[attr.name + "_path"] = str(data_path)
            pickle.dump(getattr(self, attr.name), data_path.open("wb"), protocol=protocol)

        # Pickle the metadata as originally intended:
        pickle.dump(self, file, protocol, **kwargs)


    def load(self, data_dir=None):
        """Load all saved arrays associated with this object.

        if data_dir is not passed, the the absolute path used on picking is used. Otherwise
        the files are searched by their name in the given folder
        """
        if data_dir:
            data_dir = pathlib.Path(data_dir)

        for attr, id, data_path in self._iter_descriptors(data_dir):
            VAULT[id] = pickle.load(data_path.open("rb"))

    def __del__(self):

        for attr, id, path in self._iter_descriptors(None):
            VAULT.pop(id, None)
        try:
            super().__del__()
        except AttributeError:
            pass

class MyObject(SeparateSerializationMixin):

    data = SeparateSerializationDescriptor()

    def __init__(self, data, meta_data):
        self.data = data
        self.meta_data = meta_data

Of course this is not perfect, and there are likely corner cases.当然，这并不完美，而且可能存在极端情况。 I included some safeguards in case the data-files are moved to another directory - but I did not test that.我包括了一些保护措施，以防数据文件被移动到另一个目录 - 但我没有测试。

Other than that, using those in an interactive session here went smooth, and I coud create a MyObject instance that would be pickled separated from its data attribute, which then would be loaded just when needed on unpickling.除此之外，在交互式会话中使用这些很顺利，我可以创建一个MyObject实例，该实例将与其data属性分离，然后将在需要时在 unpickling 时加载。

As for the suggestion of just "keep stuff in a database" - some of the code here can be used just as well with your objects if they live in a database, and you prefer to let the raw-data on the filesystem rather than on a 'blob column' on the database.至于仅“将内容保存在数据库中”的建议 - 如果您的对象位于数据库中，则此处的某些代码也可以与您的对象一起使用，并且您更愿意将原始数据放在文件系统上而不是放在数据库上的“blob 列”。

酸洗包含大型 numpy 数组的对象

问题描述

1 个解决方案

解决方案1
3 2020-03-13 16:10:51

酸洗包含大型 numpy 数组的对象

问题描述

1 个解决方案

解决方案1 3 2020-03-13 16:10:51

解决方案1
3 2020-03-13 16:10:51