简体   繁体   English

进程终止后,pickle、dill 和 cloudpickle 在自定义 class 上将字段作为空字典返回

[英]pickle, dill and cloudpickle returning field as empty dict on custom class after process termination

I have an object of a custom class that I am trying to serialize and permanently store.我有一个自定义 class 的 object,我正在尝试对其进行序列化并永久存储。

When I serialize it, store it, load it and use it in the same run, it works fine.当我序列化它、存储它、加载它并在同一次运行中使用它时,它工作正常。 It only messes up when I've ended the process and then try to load it again from the pickle file.当我结束该过程然后尝试从泡菜文件再次加载它时,它只会搞砸。 This is the code that works fine:这是可以正常工作的代码:

first_model = NgramModel(3, name="debug")

for paragraph in text:
    first_model.train(paragraph_to_sentences(text))
    # paragraph to sentences just uses regex to do the equivalent of splitting by punctuation

print(first_model.context_options)
# context_options is a dict (counter)

first_model = NgramModel.load_existing_model("debug")
#load_existing_model loads the pickle file. Look in the class code

print(first_model.context_options)

However, when I run this alone, it prints an empty counter:但是,当我单独运行它时,它会打印一个空计数器:

first_model = NgramModel.load_existing_model("debug")

print(first_model.context_options)

This is a shortened version of the class file (the only two methods that touch the pickle/dill are update_pickle_state and load_existing_model ):这是 class 文件的缩短版本(接触泡菜/莳萝的唯一两种方法是update_pickle_stateload_existing_model ):

import os
import dill
from itertools import count
from collections import Counter
from os import path


class NgramModel:
    context_options: dict[tuple, set[str]] = {}
    ngram_count: Counter[tuple] = Counter()
    n = 0
    pickle_path: str = None
    num_paragraphs = 0
    num_sentences = 0

    def __init__(self, n: int, **kwargs):
        self.n = n
        self.pickle_path = NgramModel.pathify(kwargs.get('name', NgramModel.gen_pickle_name())) #use name if exists else generate random name

    def train(self, paragraph_as_list: list[str]):
    '''really the central method that coordinates everything else. Takes a list of sentences, generates data(n-grams) from each, updates the fields, and saves the instance (self) to a pickle file'''
        self.num_paragraphs += 1
        for sentence in paragraph_as_list:
            self.num_sentences += 1
            generated = self.generate_Ngrams(sentence)
            self.ngram_count.update(generated)
            for ngram in generated:
                self.add_to_set(ngram)
        self.update_pickle_state()

    def update_pickle_state(self):
    '''saves instance to pickle file'''
        file = open(self.pickle_path, "wb")
        dill.dump(self, file)
        file.close()

    @staticmethod
    def load_existing_model(name: str):
    '''returns object from pickle file'''
        path = NgramModel.pathify(name)
        file = open(path, "rb")
        obj: NgramModel = dill.load(file)
        return obj

    def generate_Ngrams(self, string: str):
    '''ref: https://www.analyticsvidhya.com/blog/2021/09/what-are-n-grams-and-how-to-implement-them-in-python/'''
        words = string.split(" ")
        words = ["<start>"] * (self.n - 1) + words + ["<end>"] * (self.n - 1)

        list_of_tup = []

        for i in range(len(words) + 1 - self.n):
            list_of_tup.append((tuple(words[i + j] for j in range(self.n - 1)), words[i + self.n - 1]))

        return list_of_tup

    def add_to_set(self, ngram: tuple[tuple[str, ...], str]):
        if ngram[0] not in self.context_options:
            self.context_options[ngram[0]] = set()
        self.context_options[ngram[0]].add(ngram[1])

    @staticmethod
    def pathify(name): 
    '''converts name to path'''
        return f"models/{name}.pickle"

    @staticmethod
    def gen_pickle_name():
        for i in count():
            new_name = f"unnamed-pickle-{i}"
            if not path.exists(NgramModel.pathify(new_name)):
                return new_name

All the other fields print properly and are complete and correct except the two dicts除两个字典外,所有其他字段都正确打印并且完整且正确

The problem is that is that context_options is a mutable class-member, not an instance member.问题在于context_options是一个可变的类成员,而不是实例成员。 If I had to guess, dill is only pickling instance members, since the class definition holds class members.如果我不得不猜测,莳萝只是腌制实例成员,因为 class 定义包含 class 成员。 That would account for why you see a "filled-out" context_options when you're working in the same shell but not when you load fresh — you're using the dirtied class member in the former case.这可以解释为什么当您在同一个 shell 中工作时会看到“已填写”的 context_options,但在新加载时却没有——您在前一种情况下使用的是脏的 class 成员。

It's for stuff like this that you generally don't want to use mutable class members (or similarly, mutable default values in function signatures).对于这样的东西,您通常不想使用可变的 class 成员(或类似地,function 签名中的可变默认值)。 More typical is to use something like context_options: dict[tuple, set[str]] = None and then check if it's None in the __init__ to set it to a default value, eg, an empty dict.更典型的是使用类似context_options: dict[tuple, set[str]] = None然后检查它是否在__init__中为None以将其设置为默认值,例如空字典。 Alternatively, you could use a @dataclass and provide a field initializer, ie或者,您可以使用 @dataclass 并提供字段初始化程序,即

@dataclasses.dataclass
class NgramModel:
    context_options: dict[tuple, set[str]] = dataclasses.field(default_factory=dict)
    ...

You can observe what I mean about it being a mutable class member with, for instance...你可以观察我所说的它是一个可变的 class 成员的意思,例如......

if __name__ == '__main__':
    ng = NgramModel(3, name="debug")
    print(ng.context_options)   # {}
    ng.context_options[("foo", "bar")] = {"baz", "qux"}
    print(ng.context_options)   # {('foo', 'bar'): {'baz', 'qux'}}
    ng2 = NgramModel(3, name="debug")
    print(ng2.context_options)  # {('foo', 'bar'): {'baz', 'qux'}}

I would expect a brand new ng2 to have the same context that the brand new ng had - empty (or whatever an appropriate default is).我希望全新的 ng2 具有与全新的 ng 相同的上下文 - 空(或任何适当的默认值)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM