简体繁体 English

使用pickle进行模块化序列化（Python）

[英]Modular serialization with pickle (Python)

原文 2012-05-15 19:16:42 4 2 python/ pickle

I want to perform serialisation of some object graph in a modular way. 我想以模块化方式对某些对象图进行序列化。 That is I don't want to serialize the whole graph. 那就是我不想序列化整个图。 The reason is that this graph is big. 原因是该图很大。 I can keep timestamped version of some part of the graph, and i can do some lazy access to postpone loading of the parts i don't need right now. 我可以保留该图某些部分的带有时间戳的版本，并且可以懒惰地访问我现在不需要的部分的延迟加载。

I thought i could manage this with metaprogramming in Python. 我以为我可以使用Python中的元编程来解决这个问题。 But it seems that metaprogramming is not strong enough in Python. 但是似乎元编程在Python中不够强大。

Here's what i do for now. 这是我现在所做的。 My graph is composed of several different objects. 我的图由几个不同的对象组成。 Some of them are instances of a special class. 其中一些是特殊类的实例。 This class describes the root object to be pickled. 此类描述了要腌制的根对象。 This is where the modularity come in. Each time i pickle something it starts from one of those instances and i never pickle two of them at the same time. 这就是模块性的所在。每次我腌制某种东西时，它们都是从这些实例之一开始的，而我从不会同时腌制其中的两个实例。 Whenever there is a reference to another instance, accessible by the root object, I replace this reference by a persistant_id, thus ensuring that i won't have two of them in the same pickling stream. 每当有对另一个实例的引用（可由根对象访问）时，我都会用persistant_id替换该引用，从而确保在相同的酸洗流中不会有两个实例。 The problem comes when unpickling the stream. 当解流流时出现问题。 I can found a persistant_id of an instance which is not loaded yet. 我可以找到尚未加载的实例的persistant_id。 When this is the case, i have to wait for the target instance to be loaded before allowing access to it. 在这种情况下，我必须等待目标实例加载后才允许对其进行访问。 And i don't see anyway to do that : 而且我看不到要这样做：

1/ I tried to build an accessor which get methods return the target of the reference. 1 /我试图构建一个访问器，该访问器的get方法返回引用的目标。 Unfortunately, accessors must be placed in the class declaration, I can't assign them to the unpickled object. 不幸的是，访问器必须放在类声明中，我不能将它们分配给未选择的对象。 2/ I could store somewhere the places where references have to be resolved. 2 /我可以将引用存储在需要解析的地方。 I don't think this is possible in Python : one can't keep reference to a place (a field, or a variable), it is only possible to keep a reference to a value. 我认为这在Python中是不可能的：不能保留对位置（字段或变量）的引用，只能保留对值的引用。

My problem may not be clear. 我的问题可能不清楚。 I'm still looking for a clear formulation. 我仍在寻找明确的表述。 I tried other things like using explicit references which would be instances of some "Reference" class. 我尝试了其他事情，例如使用显式引用，这将是某些“ Reference”类的实例。 It isn't very convenient though. 虽然不是很方便。

Do you have any idea how to implement modular serialisation with pickle ? 您有任何想法如何使用pickle实现模块化序列化吗？ Would i have to change internal behaviour of Unpickler to be able to remember places where i need to load the remaining of the object graph ? 我是否必须更改Unpickler的内部行为才能记住我需要加载其余对象图的地方？ Is there another library more suitable to achieve similar results ? 是否有另一个更适合实现类似结果的库？

2 个解决方案

Here's how I think I would go about this. 我想这就是我要做的。

Have a module level dictionary mapping persistent_id to SpecialClass objects. 有一个模块级字典，将persistent_id映射到SpecialClass对象。 Every time you initialise or unpickle a SpecialClass instance, make sure that it is added to the dictionary. 每次初始化或释放SpecialClass实例时，请确保将其添加到字典中。
Override SpecialClass's __getattr__ and __setattr__ method, so that specialobj.foo = anotherspecialobj merely stores a persistent_id in a dictionary on specialobj (let's call it specialobj.specialrefs ). 重写SpecialClass的__getattr__和__setattr__方法，以便specialobj.foo = anotherspecialobj仅将一个persistent_id存储在specialobj的字典中（我们将其specialobj.specialrefs ）。 When you retrieve specialobj.foo , it finds the name in specialrefs, then finds the reference in the module-level dictionary. 检索specialobj.foo ，它将在specialobj.foo找到名称，然后在模块级字典中找到引用。
Have a module level check_graph function which would go through the known SpecialClass instances and check that all of their specialrefs were available. 具有模块级别的check_graph函数，该函数将遍历已知的SpecialClass实例并检查其所有specialrefs是否可用。

Metaprogramming is strong in Python; 元编程在Python中很强大； Python classes are extremely malleable. Python类非常具有延展性。 You can alter them after declaration all the way you want, though it's best done in a metaclass (decorator). 尽管最好在元类（装饰器）中完成操作，但是可以在声明后完全更改它们。 More than that, instances are malleable, independently of their classes. 除此之外，实例是可延展的，与类无关。

A 'reference to a place' is often simply a string. “对位置的引用”通常只是一个字符串。 Eg a reference to object's field is its name. 例如，对对象字段的引用就是其名称。 Assume you have multiple node references inside your node object. 假设您的节点对象中有多个节点引用。 You could have something like {persistent_id: (object, field_name),..} as your unresolved references table, easy to look up. 您可能有类似{persistent_id: (object, field_name),..}作为未解析的引用表，易于查找。 Similarly, in lists of nodes 'references to places' are indices. 同样，在节点列表中，“对位置的引用”是索引。

BTW, could you use a key-value database for graph storage? 顺便说一句，您可以使用键值数据库进行图形存储吗？ You'd be able to pull nodes by IDs without waiting. 您无需等待即可按ID拉节点。