简体   繁体   English

与 Python 多处理共享对象数组

[英]Sharing array of objects with Python multiprocessing

For this question, I refer to the example in Python docs discussing the "use of the SharedMemory class with NumPy arrays, accessing the same numpy.ndarray from two distinct Python shells". For this question, I refer to the example in Python docs discussing the "use of the SharedMemory class with NumPy arrays, accessing the same numpy.ndarray from two distinct Python shells".

A major change that I'd like to implement is manipulate array of class objects rather than integer values as I demonstrate below.我想实现的一个主要更改是操作 class 对象数组,而不是 integer 值,如下所示。

import numpy as np
from multiprocessing import shared_memory    

# a simplistic class example
class A(): 
    def __init__(self, x): 
        self.x = x

# numpy array of class objects 
a = np.array([A(1), A(2), A(3)])       

# create a shared memory instance
shm = shared_memory.SharedMemory(create=True, size=a.nbytes, name='psm_test0')

# numpy array backed by shared memory
b = np.ndarray(a.shape, dtype=a.dtype, buffer=shm.buf)                                    

# copy the original data into shared memory
b[:] = a[:]                                  

print(b)                                            

# array([<__main__.Foo object at 0x7fac56cd1190>,
#       <__main__.Foo object at 0x7fac56cd1970>,
#       <__main__.Foo object at 0x7fac56cd19a0>], dtype=object)

Now, in a different shell, we attach to the shared memory space and try to manipulate the contents of the array.现在,在不同的 shell 中,我们附加到共享的 memory 空间并尝试操作数组的内容。

import numpy as np
from multiprocessing import shared_memory

# attach to the existing shared space
existing_shm = shared_memory.SharedMemory(name='psm_test0')

c = np.ndarray((3,), dtype=object, buffer=existing_shm.buf)

Even before we are able to manipulate c , printing it will result in a segmentation fault.甚至在我们能够操作c ,打印它就会导致分段错误。 Indeed I can not expect to observe a behaviour that has not been written into the module, so my question is what can I do to work with a shared array of objects?事实上,我不能指望观察到尚未写入模块的行为,所以我的问题是如何处理共享对象数组?

I'm currently pickling the list but protected read/writes add a fair bit of overhead.我目前正在挑选列表,但受保护的读/写会增加相当多的开销。 I've also tried using Namespace , which was quite slow because indexed writes are not allowed.我也尝试过使用Namespace ,这很慢,因为不允许索引写入。 Another idea could be to use share Ctypes Structure in a ShareableList but I wouldn't know where to start with that.另一个想法可能是在 ShareableList 中使用ShareableList Ctypes 结构,但我不知道从哪里开始。

In addition there is also a design aspect: it appears that there is an open bug in shared_memory that may affect my implementation wherein I have several processes working on different elements of the array.此外还有一个设计方面: shared_memory中似乎有一个开放的错误可能会影响我的实现,其中我有几个进程在处理数组的不同元素。

Is there a more scalable way of sharing a large list of objects between several processes so that at any given time all running processes interact with a unique object/element in the list?是否有一种更可扩展的方式在多个进程之间共享大量对象列表,以便在任何给定时间所有正在运行的进程都与列表中的唯一对象/元素交互?

UPDATE: At this point, I will also accept partial answers that talk about whether this can be achieved with Python at all.更新:在这一点上,我也将接受部分答案,讨论是否可以通过 Python 来实现。

So, I did a bit of research ( Shared Memory Objects in Multiprocessing ) and came up with a few ideas:所以,我做了一些研究( Shared Memory Objects in Multiprocessing )并提出了一些想法:

Pass numpy array of bytes通过 numpy 字节数组

Serialize the objects, then save them as byte strings to a numpy array.序列化对象,然后将它们作为字节字符串保存到 numpy 数组中。 Problematic here is that这里有问题的是

  1. One one needs to pass the data type from the creator of 'psm_test0' to any consumer of 'psm_test0' .需要将数据类型从 'psm_test0' 的创建者传递给 ' 'psm_test0' 'psm_test0'的任何消费者。 This could be done with another shared memory though.这可以通过另一个共享的 memory 来完成。

  2. pickle and unpickle is essentailly like deepcopy , ie it actually copies the underlying data. pickleunpickle本质上类似于deepcopy ,即它实际上复制了底层数据。

The code for the 'main' process reads: “主”进程的代码如下:

import pickle
from multiprocessing import shared_memory
import numpy as np


# a simplistic class example
class A():
    def __init__(self, x):
        self.x = x

    def pickle(self):
        return pickle.dumps(self)

    @classmethod
    def unpickle(self, bts):
        return pickle.loads(bts)


if __name__ == '__main__':
    # Test pickling procedure
    a = A(1)
    print(A.unpickle(a.pickle()).x)
    # >>> 1

    # numpy array of byte strings
    a_arr = np.array([A(1).pickle(), A(2).pickle(), A('This is a really long test string which should exceed 42 bytes').pickle()])

    # create a shared memory instance
    shm = shared_memory.SharedMemory(
        create=True,
        size=a_arr.nbytes,
        name='psm_test0'
    )

    # numpy array backed by shared memory
    b_arr = np.ndarray(a_arr.shape, dtype=a_arr.dtype, buffer=shm.buf)

    # copy the original data into shared memory
    b_arr[:] = a_arr[:]

    print(b_arr.dtype)
    # 'S105'

and for the consumer并且对于消费者

import numpy as np
from multiprocessing import shared_memory
from test import A

# attach to the existing shared space
existing_shm = shared_memory.SharedMemory(name='psm_test0')

c = np.ndarray((3,), dtype='S105', buffer=existing_shm.buf)

# Test data transfer
arr = [a.x for a in list(map(A.unpickle, c))]
print(arr)
# [1, 2, ...]

I'd say you have a few ways of going forward:我想说你有几种前进的方法:

  1. Stay with simple data types.保持简单的数据类型。

  2. Implement something using the C api, but I can't really help you there.使用 C api 实现一些东西,但我真的不能帮你。

  3. Use Rust使用Rust

  4. Use Mangers .使用马槽 You maybe loose out on some performance (I'd like to see a real benchmark though), but You can get a relatively safe and simple interface for shared objects.您可能会失去一些性能(尽管我希望看到一个真正的基准测试),但是您可以获得一个相对安全且简单的共享对象接口。

  5. Use Redis , which also has Python bindings...使用Redis ,它也有 Python 绑定...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM