简体   繁体   English

多处理中的共享内存对象

[英]Shared-memory objects in multiprocessing

Suppose I have a large in memory numpy array, I have a function func that takes in this giant array as input (together with some other parameters).假设我有一个很大的内存 numpy 数组,我有一个函数func将这个巨大的数组作为输入(以及一些其他参数)。 func with different parameters can be run in parallel.具有不同参数的func可以并行运行。 For example:例如:

def func(arr, param):
    # do stuff to arr, param

# build array arr

pool = Pool(processes = 6)
results = [pool.apply_async(func, [arr, param]) for param in all_params]
output = [res.get() for res in results]

If I use multiprocessing library, then that giant array will be copied for multiple times into different processes.如果我使用多处理库,那么这个巨大的数组将被多次复制到不同的进程中。

Is there a way to let different processes share the same array?有没有办法让不同的进程共享同一个数组? This array object is read-only and will never be modified.这个数组对象是只读的,永远不会被修改。

What's more complicated, if arr is not an array, but an arbitrary python object, is there a way to share it?更复杂的是,如果 arr 不是一个数组,而是一个任意的python对象,有没有办法共享它?

[EDITED] [编辑]

I read the answer but I am still a bit confused.我读了答案,但我仍然有点困惑。 Since fork() is copy-on-write, we should not invoke any additional cost when spawning new processes in python multiprocessing library.由于 fork() 是写时复制,因此在 python 多处理库中生成新进程时,我们不应该调用任何额外的成本。 But the following code suggests there is a huge overhead:但是以下代码表明存在巨大的开销:

from multiprocessing import Pool, Manager
import numpy as np; 
import time

def f(arr):
    return len(arr)

t = time.time()
arr = np.arange(10000000)
print "construct array = ", time.time() - t;


pool = Pool(processes = 6)

t = time.time()
res = pool.apply_async(f, [arr,])
res.get()
print "multiprocessing overhead = ", time.time() - t;

output (and by the way, the cost increases as the size of the array increases, so I suspect there is still overhead related to memory copying):输出(顺便说一句,成本随着数组大小的增加而增加,所以我怀疑仍然存在与内存复制相关的开销):

construct array =  0.0178790092468
multiprocessing overhead =  0.252444982529

Why is there such huge overhead, if we didn't copy the array?如果我们不复制数组,为什么会有这么大的开销? And what part does the shared memory save me?共享内存拯救了我的哪一部分?

If you use an operating system that uses copy-on-write fork() semantics (like any common unix), then as long as you never alter your data structure it will be available to all child processes without taking up additional memory.如果您使用的操作系统使用写时复制fork()语义(就像任何常见的 unix 一样),那么只要您从不更改数据结构,所有子进程都可以使用它,而不会占用额外的内存。 You will not have to do anything special (except make absolutely sure you don't alter the object).您不必做任何特别的事情(除非绝对确保您没有更改对象)。

The most efficient thing you can do for your problem would be to pack your array into an efficient array structure (using numpy or array ), place that in shared memory, wrap it with multiprocessing.Array , and pass that to your functions.可以为你的问题做的最有效的事情是将你的数组打包成一个有效的数组结构(使用numpyarray ),将它放在共享内存中,用multiprocessing.Array包装它,然后将它传递给你的函数。 This answer shows how to do that .这个答案显示了如何做到这一点

If you want a writeable shared object, then you will need to wrap it with some kind of synchronization or locking.如果你想要一个可写的共享对象,那么你需要用某种同步或锁定来包装它。 multiprocessing provides two methods of doing this : one using shared memory (suitable for simple values, arrays, or ctypes) or a Manager proxy, where one process holds the memory and a manager arbitrates access to it from other processes (even over a network). multiprocessing提供了两种方法来做到这一点:一种使用共享内存(适用于简单的值、数组或 ctypes)或Manager代理,其中一个进程保存内存,管理器仲裁其他进程(甚至通过网络)对它的访问.

The Manager approach can be used with arbitrary Python objects, but will be slower than the equivalent using shared memory because the objects need to be serialized/deserialized and sent between processes. Manager方法可用于任意 Python 对象,但会比使用共享内存的等效方法慢,因为对象需要序列化/反序列化并在进程之间发送。

There are a wealth of parallel processing libraries and approaches available in Python . Python中有大量的并行处理库和方法可用 multiprocessing is an excellent and well rounded library, but if you have special needs perhaps one of the other approaches may be better. multiprocessing是一个优秀且全面的库,但如果您有特殊需求,也许其他方法之一可能会更好。

I run into the same problem and wrote a little shared-memory utility class to work around it.我遇到了同样的问题并编写了一个小的共享内存实用程序类来解决它。

I'm using multiprocessing.RawArray (lockfree), and also the access to the arrays is not synchronized at all (lockfree), be careful not to shoot your own feet.我正在使用multiprocessing.RawArray (无锁),而且对数组的访问根本不同步(无锁),小心不要射到自己的脚。

With the solution I get speedups by a factor of approx 3 on a quad-core i7.使用该解决方案,我在四核 i7 上获得了大约 3 倍的加速。

Here's the code: Feel free to use and improve it, and please report back any bugs.代码如下: 随意使用和改进它,并请报告任何错误。

'''
Created on 14.05.2013

@author: martin
'''

import multiprocessing
import ctypes
import numpy as np

class SharedNumpyMemManagerError(Exception):
    pass

'''
Singleton Pattern
'''
class SharedNumpyMemManager:    

    _initSize = 1024

    _instance = None

    def __new__(cls, *args, **kwargs):
        if not cls._instance:
            cls._instance = super(SharedNumpyMemManager, cls).__new__(
                                cls, *args, **kwargs)
        return cls._instance        

    def __init__(self):
        self.lock = multiprocessing.Lock()
        self.cur = 0
        self.cnt = 0
        self.shared_arrays = [None] * SharedNumpyMemManager._initSize

    def __createArray(self, dimensions, ctype=ctypes.c_double):

        self.lock.acquire()

        # double size if necessary
        if (self.cnt >= len(self.shared_arrays)):
            self.shared_arrays = self.shared_arrays + [None] * len(self.shared_arrays)

        # next handle
        self.__getNextFreeHdl()        

        # create array in shared memory segment
        shared_array_base = multiprocessing.RawArray(ctype, np.prod(dimensions))

        # convert to numpy array vie ctypeslib
        self.shared_arrays[self.cur] = np.ctypeslib.as_array(shared_array_base)

        # do a reshape for correct dimensions            
        # Returns a masked array containing the same data, but with a new shape.
        # The result is a view on the original array
        self.shared_arrays[self.cur] = self.shared_arrays[self.cnt].reshape(dimensions)

        # update cnt
        self.cnt += 1

        self.lock.release()

        # return handle to the shared memory numpy array
        return self.cur

    def __getNextFreeHdl(self):
        orgCur = self.cur
        while self.shared_arrays[self.cur] is not None:
            self.cur = (self.cur + 1) % len(self.shared_arrays)
            if orgCur == self.cur:
                raise SharedNumpyMemManagerError('Max Number of Shared Numpy Arrays Exceeded!')

    def __freeArray(self, hdl):
        self.lock.acquire()
        # set reference to None
        if self.shared_arrays[hdl] is not None: # consider multiple calls to free
            self.shared_arrays[hdl] = None
            self.cnt -= 1
        self.lock.release()

    def __getArray(self, i):
        return self.shared_arrays[i]

    @staticmethod
    def getInstance():
        if not SharedNumpyMemManager._instance:
            SharedNumpyMemManager._instance = SharedNumpyMemManager()
        return SharedNumpyMemManager._instance

    @staticmethod
    def createArray(*args, **kwargs):
        return SharedNumpyMemManager.getInstance().__createArray(*args, **kwargs)

    @staticmethod
    def getArray(*args, **kwargs):
        return SharedNumpyMemManager.getInstance().__getArray(*args, **kwargs)

    @staticmethod    
    def freeArray(*args, **kwargs):
        return SharedNumpyMemManager.getInstance().__freeArray(*args, **kwargs)

# Init Singleton on module load
SharedNumpyMemManager.getInstance()

if __name__ == '__main__':

    import timeit

    N_PROC = 8
    INNER_LOOP = 10000
    N = 1000

    def propagate(t):
        i, shm_hdl, evidence = t
        a = SharedNumpyMemManager.getArray(shm_hdl)
        for j in range(INNER_LOOP):
            a[i] = i

    class Parallel_Dummy_PF:

        def __init__(self, N):
            self.N = N
            self.arrayHdl = SharedNumpyMemManager.createArray(self.N, ctype=ctypes.c_double)            
            self.pool = multiprocessing.Pool(processes=N_PROC)

        def update_par(self, evidence):
            self.pool.map(propagate, zip(range(self.N), [self.arrayHdl] * self.N, [evidence] * self.N))

        def update_seq(self, evidence):
            for i in range(self.N):
                propagate((i, self.arrayHdl, evidence))

        def getArray(self):
            return SharedNumpyMemManager.getArray(self.arrayHdl)

    def parallelExec():
        pf = Parallel_Dummy_PF(N)
        print(pf.getArray())
        pf.update_par(5)
        print(pf.getArray())

    def sequentialExec():
        pf = Parallel_Dummy_PF(N)
        print(pf.getArray())
        pf.update_seq(5)
        print(pf.getArray())

    t1 = timeit.Timer("sequentialExec()", "from __main__ import sequentialExec")
    t2 = timeit.Timer("parallelExec()", "from __main__ import parallelExec")

    print("Sequential: ", t1.timeit(number=1))    
    print("Parallel: ", t2.timeit(number=1))

This is the intended use case for Ray , which is a library for parallel and distributed Python.这是Ray的预期用例,它是一个用于并行和分布式 Python 的库。 Under the hood, it serializes objects using the Apache Arrow data layout (which is a zero-copy format) and stores them in a shared-memory object store so they can be accessed by multiple processes without creating copies.在幕后,它使用Apache Arrow数据布局(这是一种零拷贝格式)序列化对象并将它们存储在共享内存对象存储中,以便多个进程可以访问它们而无需创建副本。

The code would look like the following.代码如下所示。

import numpy as np
import ray

ray.init()

@ray.remote
def func(array, param):
    # Do stuff.
    return 1

array = np.ones(10**6)
# Store the array in the shared memory object store once
# so it is not copied multiple times.
array_id = ray.put(array)

result_ids = [func.remote(array_id, i) for i in range(4)]
output = ray.get(result_ids)

If you don't call ray.put then the array will still be stored in shared memory, but that will be done once per invocation of func , which is not what you want.如果您不调用ray.put则该数组仍将存储在共享内存中,但每次调用func都会执行一次,这不是您想要的。

Note that this will work not only for arrays but also for objects that contain arrays , eg, dictionaries mapping ints to arrays as below.请注意,这不仅适用于数组,也适用于包含数组的对象,例如,将整数映射到数组的字典如下所示。

You can compare the performance of serialization in Ray versus pickle by running the following in IPython.您可以通过在 IPython 中运行以下命令来比较 Ray 和 Pickle 中序列化的性能。

import numpy as np
import pickle
import ray

ray.init()

x = {i: np.ones(10**7) for i in range(20)}

# Time Ray.
%time x_id = ray.put(x)  # 2.4s
%time new_x = ray.get(x_id)  # 0.00073s

# Time pickle.
%time serialized = pickle.dumps(x)  # 2.6s
%time deserialized = pickle.loads(serialized)  # 1.9s

Serialization with Ray is only slightly faster than pickle, but deserialization is 1000x faster because of the use of shared memory (this number will of course depend on the object).使用 Ray 的序列化仅比 pickle 快一点,但由于使用共享内存(这个数字当然取决于对象),反序列化快 1000 倍。

See the Ray documentation .请参阅Ray 文档 You can read more about fast serialization using Ray and Arrow .您可以阅读有关使用 Ray 和 Arrow 进行快速序列化的更多信息。 Note I'm one of the Ray developers.请注意,我是 Ray 开发人员之一。

Like Robert Nishihara mentioned, Apache Arrow makes this easy, specifically with the Plasma in-memory object store, which is what Ray is built on.就像 Robert Nishihara 提到的那样,Apache Arrow 使这一切变得简单,特别是使用 Plasma 内存中对象存储,Ray 就建立在它的基础上。

I made brain-plasma specifically for this reason - fast loading and reloading of big objects in a Flask app.出于这个原因,我专门制作了脑等离子体- 在 Flask 应用程序中快速加载和重新加载大对象。 It's a shared-memory object namespace for Apache Arrow-serializable objects, including pickle 'd bytestrings generated by pickle.dumps(...) .它是 Apache Arrow 可序列化对象的共享内存对象命名空间,包括pickle.dumps(...)生成的pickle 'd 字节pickle.dumps(...)

The key difference with Apache Ray and Plasma is that it keeps track of object IDs for you. Apache Ray 和 Plasma 的主要区别在于它为您跟踪对象 ID。 Any processes or threads or programs that are running on locally can share the variables' values by calling the name from any Brain object.在本地运行的任何进程或线程或程序都可以通过从任何Brain对象调用名称来共享变量的值。

$ pip install brain-plasma
$ plasma_store -m 10000000 -s /tmp/plasma

from brain_plasma import Brain
brain = Brain(path='/tmp/plasma/')

brain['a'] = [1]*10000

brain['a']
# >>> [1,1,1,1,...]

One simple way to make large objects (eg Pandas DataFrames or ndarrays) available to all processes during multiprocessing is to use the 'sharedmem' option with joblib. 使大型对象(例如Pandas DataFrames或ndarrays)在多处理期间可用于所有进程的一种简单方法是对joblib使用'sharedmem'选项。 Here there is no serialization via pickling or other mechanisms so it's very fast. 这里没有通过酸洗或其他机制进行序列化,因此非常快。 If lock contention causes performance issues, you should be able to get around it by making local copies of the object in your processes (still much faster than serializing & deserializing). 如果锁争用导致性能问题,您应该能够通过在进程中制作对象的本地副本来解决它(仍然比序列化和反序列化要快得多)。

The output below shows that all processes were able to access the shared pandas DataFrame (the 4 as the second value in the output tuples). 下面的输出显示所有进程都能够访问共享的熊猫DataFrame(输出元组中的第二个值是4)。 The descending order of the first value in the output tuples verifies that the results list contains the values returned in the order in which they were sent, not in which their execution was completed. 输出元组中第一个值的降序校验将验证结果列表是否包含按发送顺序返回的值,而不是按执行顺序返回的值。

from joblib import Parallel, delayed
import time
import pandas as pd

my_shared_object = pd.DataFrame({'data':range(10000)})

def f(n):
    time.sleep(n / 10)
    return n, my_shared_object['data'][4]

result = Parallel(n_jobs=3, require='sharedmem')(delayed(f)(i) for i in range(10,1,-1))

[(10, 4), (9, 4), (8, 4), (7, 4), (6, 4), (5, 4), (4, 4), (3, 4), (2, 4)] [(10,4),(9,4),(8,4),(7,4),(6,4),(5,4),(4,4),(3,4),( 2,4)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM