简体   繁体   English

覆盖的 __setitem__ 调用串行工作,但在 apply_async 调用中中断

[英]Overridden __setitem__ call works in serial but breaks in apply_async call

I've been fighting with this problem for some time now and I've finally managed to narrow down the issue and create a minimum working example.我一直在与这个问题作斗争一段时间,我终于设法缩小问题范围并创建一个最小的工作示例。

The summary of the problem is that I have a class that inherits from a dict to facilitate parsing of misc.问题的总结是我有一个继承自dict的class,以方便解析misc。 input files.输入文件。 I've overridden the the __setitem__ call to support recursive indexing of sections in our input file (eg parser['some.section.variable'] is equivalent to parser['some']['section']['variable'] ).我重写了__setitem__调用以支持对输入文件中的部分进行递归索引(例如parser['some.section.variable']等效于parser['some']['section']['variable'] ) . This has been working great for us for over a year now, but we just ran into an issue when passing these Parser classes through a multiprocessing.apply_async call.这对我们来说已经工作了一年多,但是我们在通过multiprocessing.apply_async调用传递这些Parser类时遇到了一个问题。

Show below is the minimum working example - obviously the __setitem__ call isn't doing anything special, but it's important that it accesses some class attribute like self.section_delimiter - this is where it breaks.下面显示的是最小的工作示例 - 显然__setitem__调用没有做任何特别的事情,但重要的是它访问一些 class 属性,如self.section_delimiter - 这是它中断的地方。 It doesn't break in the initial call or in the serial function call.它不会在初始调用或串行 function 调用中中断。 But when you call the some_function (which doesn't do anything either) using apply_async , it crashes.但是,当您使用apply_async调用some_function (它也不做任何事情)时,它会崩溃。

import multiprocessing as mp
import numpy as np

class Parser(dict):

    def __init__(self, file_name : str = None):
        print('\t__init__')
        super().__init__()
        self.section_delimiter = "."
    
    def __setitem__(self, key, value):
        print('\t__setitem__')
        self.section_delimiter
        dict.__setitem__(self, key, value)
           
def some_function(parser):
    pass

if __name__ == "__main__":

    print("Initialize creation/setting")
    parser = Parser()
    parser['x'] = 1

    print("Single serial call works fine")
    some_function(parser)

    print("Parallel async call breaks on line 16?")
    pool = mp.Pool(1)
    for i in range(1):
        pool.apply_async(some_function, (parser,))

    pool.close()
    pool.join()

If you run the code below, you'll get the following output如果你运行下面的代码,你会得到下面的 output

Initialize creation/setting
    __init__
    __setitem__
Single serial call works fine
Parallel async call breaks on line 16?
    __setitem__
Process ForkPoolWorker-1:
Traceback (most recent call last):
  File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
    task = get()
  File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/queues.py", line 354, in get
    return _ForkingPickler.loads(res)
  File "test_apply_async.py", line 13, in __setitem__
    self.section_delimiter
AttributeError: 'Parser' object has no attribute 'section_delimiter'

Any help is greatly appreciated.任何帮助是极大的赞赏。 I spent considerable time tracking down this bug and reproducing a minimal example.我花了相当多的时间来追踪这个错误并重现一个最小的例子。 I would love to not only fix it, but clearly fill some gap in my understanding on how these apply_async and inheritance/overridden methods interact.我不仅想修复它,而且还想清楚地填补我对这些apply_async和继承/覆盖方法如何交互的理解上的一些空白。

Let me know if you need any more information.如果您需要更多信息,请告诉我。

Thank you very much!非常感谢!

Isaac艾萨克

Cause原因

The cause of the problem is that multiprocessing serializes and deserializes your Parser object to move its data across process boundaries.问题的原因是multiprocessing序列化和反序列化您的Parser object 以跨进程边界移动其数据。 This is done using pickle .这是使用pickle完成的。 By default pickle does not call __init__() when deserializing classes.默认情况下,pickle 在反序列化类时不会调用__init__() Because of this self.section_delimiter is not set when the deserializer calls __setitem__() to restore the items in your dictionary and you get the error:由于这个self.section_delimiter在反序列化程序调用__setitem__()以恢复字典中的项目时未设置,您会收到错误:

AttributeError: 'Parser' object has no attribute 'section_delimiter' AttributeError: 'Parser' object 没有属性 'section_delimiter'

Using just pickle and no multiprocessing gives the same error:仅使用 pickle 而没有多处理会产生相同的错误:

import pickle

parser = Parser()
parser['x'] = 1

data = pickle.dumps(parser)
copy = pickle.loads(data) # Same AttributeError here

Deserialization will work for an object with no items and the value of section_delimiter will be restored:反序列化将适用于没有项目的 object,并且section_delimiter的值将被恢复:

import pickle

parser = Parser()
parser.section_delimiter = "|"

data = pickle.dumps(parser)
copy = pickle.loads(data)

print(copy.section_delimiter) # Prints "|"

So in a sense you are just unlucky that pickle calls __setitem__() before it restores the rest of the state of your Parser .因此,从某种意义上说,pickle 在恢复Parser的 state 的 rest 之前调用__setitem__()只是很不幸。

Workaround解决方法

You can work around this by setting section_delimiter in __new__() and telling pickle what arguments to pass to __new__() by implementing __getnewargs__() :您可以通过在__new__()中设置section_delimiter并通过实现__getnewargs__()告诉 pickle 将什么 arguments 传递给__new__()来解决此问题:

def __new__(cls, *args):
    self = super(Parser, cls).__new__(cls)
    self.section_delimiter = args[0] if args else "."
    return self

def __getnewargs__(self):
    return (self.section_delimiter,)

__getnewargs__() returns a tuple of arguments. __getnewargs__()返回一个 arguments 的元组。 Because section_delimiter is set in __new__() , it is no longer necessary to set it in __init__() .因为section_delimiter是在__new__()中设置的,所以不再需要在__init__()中设置它。

This is the code of your Parser class after the change:这是更改后Parser class 的代码:

class Parser(dict):

    def __init__(self, file_name : str = None):
        print('\t__init__')
        super().__init__()

    def __new__(cls, *args):
        self = super(Parser, cls).__new__(cls)
        self.section_delimiter = args[0] if args else "."
        return self

    def __getnewargs__(self):
        return (self.section_delimiter,)
 
    def __setitem__(self, key, value):
        print('\t__setitem__')
        self.section_delimiter
        dict.__setitem__(self, key, value)

Simpler solution更简单的解决方案

The reason pickle calls __setitem__() on your Parser object is because it is a dictionary. pickle 在Parser object 上调用__setitem__()的原因是因为它一个字典。 If your Parser is just a class that happens to implement __setitem__() and __getitem__() and has a dictionary to implement those calls then pickle will not call __setitem__() and serialization will work with no extra code:如果您的Parser只是一个 class 恰好实现了__setitem__()__getitem__()并且一个字典来实现这些调用,那么 pickle 将不会调用__setitem__()并且序列化将无需额外代码即可工作:

class Parser:

    def __init__(self, file_name : str = None):
        print('\t__init__')
        self.dict = { }
        self.section_delimiter = "."

    def __setitem__(self, key, value):
        print('\t__setitem__')
        self.section_delimiter
        self.dict[key] = value

    def __getitem__(self, key):
        return self.dict[key]

So if there is no other reason for your Parser to be a dictionary, I would just not use inheritance here.因此,如果您的Parser没有其他理由成为字典,我不会在这里使用 inheritance 。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM