[英]Overridden __setitem__ call works in serial but breaks in apply_async call
I've been fighting with this problem for some time now and I've finally managed to narrow down the issue and create a minimum working example.我一直在与这个问题作斗争一段时间,我终于设法缩小问题范围并创建一个最小的工作示例。
The summary of the problem is that I have a class that inherits from a dict
to facilitate parsing of misc.问题的总结是我有一个继承自
dict
的class,以方便解析misc。 input files.输入文件。 I've overridden the the
__setitem__
call to support recursive indexing of sections in our input file (eg parser['some.section.variable']
is equivalent to parser['some']['section']['variable']
).我重写了
__setitem__
调用以支持对输入文件中的部分进行递归索引(例如parser['some.section.variable']
等效于parser['some']['section']['variable']
) . This has been working great for us for over a year now, but we just ran into an issue when passing these Parser
classes through a multiprocessing.apply_async
call.这对我们来说已经工作了一年多,但是我们在通过
multiprocessing.apply_async
调用传递这些Parser
类时遇到了一个问题。
Show below is the minimum working example - obviously the __setitem__
call isn't doing anything special, but it's important that it accesses some class attribute like self.section_delimiter
- this is where it breaks.下面显示的是最小的工作示例 - 显然
__setitem__
调用没有做任何特别的事情,但重要的是它访问一些 class 属性,如self.section_delimiter
- 这是它中断的地方。 It doesn't break in the initial call or in the serial function call.它不会在初始调用或串行 function 调用中中断。 But when you call the
some_function
(which doesn't do anything either) using apply_async
, it crashes.但是,当您使用
apply_async
调用some_function
(它也不做任何事情)时,它会崩溃。
import multiprocessing as mp
import numpy as np
class Parser(dict):
def __init__(self, file_name : str = None):
print('\t__init__')
super().__init__()
self.section_delimiter = "."
def __setitem__(self, key, value):
print('\t__setitem__')
self.section_delimiter
dict.__setitem__(self, key, value)
def some_function(parser):
pass
if __name__ == "__main__":
print("Initialize creation/setting")
parser = Parser()
parser['x'] = 1
print("Single serial call works fine")
some_function(parser)
print("Parallel async call breaks on line 16?")
pool = mp.Pool(1)
for i in range(1):
pool.apply_async(some_function, (parser,))
pool.close()
pool.join()
If you run the code below, you'll get the following output如果你运行下面的代码,你会得到下面的 output
Initialize creation/setting
__init__
__setitem__
Single serial call works fine
Parallel async call breaks on line 16?
__setitem__
Process ForkPoolWorker-1:
Traceback (most recent call last):
File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
task = get()
File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/queues.py", line 354, in get
return _ForkingPickler.loads(res)
File "test_apply_async.py", line 13, in __setitem__
self.section_delimiter
AttributeError: 'Parser' object has no attribute 'section_delimiter'
Any help is greatly appreciated.任何帮助是极大的赞赏。 I spent considerable time tracking down this bug and reproducing a minimal example.
我花了相当多的时间来追踪这个错误并重现一个最小的例子。 I would love to not only fix it, but clearly fill some gap in my understanding on how these
apply_async
and inheritance/overridden methods interact.我不仅想修复它,而且还想清楚地填补我对这些
apply_async
和继承/覆盖方法如何交互的理解上的一些空白。
Let me know if you need any more information.如果您需要更多信息,请告诉我。
Thank you very much!非常感谢!
Isaac艾萨克
The cause of the problem is that multiprocessing
serializes and deserializes your Parser
object to move its data across process boundaries.问题的原因是
multiprocessing
序列化和反序列化您的Parser
object 以跨进程边界移动其数据。 This is done using pickle .这是使用pickle完成的。 By default pickle does not call
__init__()
when deserializing classes.默认情况下,pickle 在反序列化类时不会调用
__init__()
。 Because of this self.section_delimiter
is not set when the deserializer calls __setitem__()
to restore the items in your dictionary and you get the error:由于这个
self.section_delimiter
在反序列化程序调用__setitem__()
以恢复字典中的项目时未设置,您会收到错误:
AttributeError: 'Parser' object has no attribute 'section_delimiter'
AttributeError: 'Parser' object 没有属性 'section_delimiter'
Using just pickle and no multiprocessing gives the same error:仅使用 pickle 而没有多处理会产生相同的错误:
import pickle
parser = Parser()
parser['x'] = 1
data = pickle.dumps(parser)
copy = pickle.loads(data) # Same AttributeError here
Deserialization will work for an object with no items and the value of section_delimiter
will be restored:反序列化将适用于没有项目的 object,并且
section_delimiter
的值将被恢复:
import pickle
parser = Parser()
parser.section_delimiter = "|"
data = pickle.dumps(parser)
copy = pickle.loads(data)
print(copy.section_delimiter) # Prints "|"
So in a sense you are just unlucky that pickle calls __setitem__()
before it restores the rest of the state of your Parser
.因此,从某种意义上说,pickle 在恢复
Parser
的 state 的 rest 之前调用__setitem__()
只是很不幸。
You can work around this by setting section_delimiter
in __new__()
and telling pickle what arguments to pass to __new__()
by implementing __getnewargs__()
:您可以通过在
__new__()
中设置section_delimiter
并通过实现__getnewargs__()
告诉 pickle 将什么 arguments 传递给__new__()
来解决此问题:
def __new__(cls, *args):
self = super(Parser, cls).__new__(cls)
self.section_delimiter = args[0] if args else "."
return self
def __getnewargs__(self):
return (self.section_delimiter,)
__getnewargs__()
returns a tuple of arguments. __getnewargs__()
返回一个 arguments 的元组。 Because section_delimiter
is set in __new__()
, it is no longer necessary to set it in __init__()
.因为
section_delimiter
是在__new__()
中设置的,所以不再需要在__init__()
中设置它。
This is the code of your Parser
class after the change:这是更改后
Parser
class 的代码:
class Parser(dict):
def __init__(self, file_name : str = None):
print('\t__init__')
super().__init__()
def __new__(cls, *args):
self = super(Parser, cls).__new__(cls)
self.section_delimiter = args[0] if args else "."
return self
def __getnewargs__(self):
return (self.section_delimiter,)
def __setitem__(self, key, value):
print('\t__setitem__')
self.section_delimiter
dict.__setitem__(self, key, value)
The reason pickle calls __setitem__()
on your Parser
object is because it is a dictionary. pickle 在
Parser
object 上调用__setitem__()
的原因是因为它是一个字典。 If your Parser
is just a class that happens to implement __setitem__()
and __getitem__()
and has a dictionary to implement those calls then pickle will not call __setitem__()
and serialization will work with no extra code:如果您的
Parser
只是一个 class 恰好实现了__setitem__()
和__getitem__()
并且有一个字典来实现这些调用,那么 pickle 将不会调用__setitem__()
并且序列化将无需额外代码即可工作:
class Parser:
def __init__(self, file_name : str = None):
print('\t__init__')
self.dict = { }
self.section_delimiter = "."
def __setitem__(self, key, value):
print('\t__setitem__')
self.section_delimiter
self.dict[key] = value
def __getitem__(self, key):
return self.dict[key]
So if there is no other reason for your Parser
to be a dictionary, I would just not use inheritance here.因此,如果您的
Parser
没有其他理由成为字典,我不会在这里使用 inheritance 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.