简体   繁体   English

如何在python进程之间共享多维arrays的列表?

[英]How to share a list of multidimensional arrays between python processes?

I am trying to speed up my code by splitting the job among several python processes.我试图通过在几个 python 进程之间拆分作业来加速我的代码。 In the single-threaded version of the code, I am looping through a code that accumulates the result in several matrices of different dimensions.在代码的单线程版本中,我正在循环执行一个代码,该代码将结果累积到多个不同维度的矩阵中。 Since there's no data sharing between each iteration, I can divide the task among several processes, each one having its own local set of matrices to accumulate the result.由于每次迭代之间没有数据共享,我可以将任务分配给多个进程,每个进程都有自己的本地矩阵集来累积结果。 When all the processes are done, I combine the matrices of all the processes.当所有的过程都完成后,我将所有过程的矩阵组合起来。

My idea of solving the issue is to pass a list of the same matrices to each process such that each process writes to this matrix when it's done.我解决这个问题的想法是将相同矩阵的列表传递给每个进程,以便每个进程在完成时写入该矩阵。 My question is, how do I pass this list of numpy array matrices to the processes?我的问题是,如何将这个 numpy 数组矩阵列表传递给进程? This seems like a straightforward thing to do except that it seems I can only pass a 1D array to the processes.这似乎是一件简单的事情,只是我似乎只能将一维数组传递给进程。 Although a temporary solution would be to flatten all the numpy arrays and keep track of where each one begins and ends, is there a way where I simply pass a list of the matrices to the processes?尽管临时解决方案是展平所有 numpy arrays 并跟踪每个开始和结束的位置,但有没有一种方法可以简单地将矩阵列表传递给进程?

For 1D arrays, there are previous answers show how to do that with shared memory .对于 1D arrays,之前的答案显示了如何使用共享 memory来做到这一点。 This post for example.这篇文章为例。 For multidimensional arrays, a similar approach can be used since reshaping an array does not copy its content .对于多维 arrays,可以使用类似的方法,因为重塑数组不会复制其内容 You just need the shape (and possibly the strides) of the array to reshape them and operate on the reshaped array.您只需要数组的形状(可能还有步幅)来重塑它们并对重塑后的数组进行操作。 Thus, you need to send the buffer and the shape to the processes so then you can convert the buffer back to a multidimensional Numpy array.因此,您需要将缓冲区和形状发送到进程,然后您可以将缓冲区转换回多维 Numpy 数组。

Here is a solution that does not require Python version >= 3.8 and just uses multiprocessing.Array .这是一个不需要 Python 版本 >= 3.8 并且只使用multiprocessing.Array的解决方案。 The idea is to use such a shared array as the backing store for a numpy array.这个想法是使用这样一个共享数组作为numpy数组的后备存储。

In this example we have each process in the pool initialize a global variable np_array and then we do not have to explicitly pass the shared array to each worker function. This avoids the worker functions from having to concern themselves with re-creating a numpy array from the shared array.在此示例中,我们让池中的每个进程都初始化一个全局变量np_array ,然后我们不必显式地将共享数组传递给每个工作人员 function。这避免了工作人员函数不必担心重新创建一个numpy数组共享数组。 Moreover, this re-creation only has to be done N times where N is the pool size rather than M times where M is the number of tasks submitted to the pool.此外,这种重新创建只需执行N次,其中N是池大小,而不是M次,其中M是提交到池的任务数。 If you find global variables an anathema, then the alternative is to explicitly pass the shared array as an argument to each worker function and have it re-create the numpy array from it.如果您发现全局变量令人厌恶,那么另一种方法是将共享数组作为参数显式传递给每个 worker function,并让它从中重新创建numpy数组。

import numpy as np
from multiprocessing import Array, Pool

def np_array_from_shared_array(shared_array, shape, is_locked_array=True):
    shared_array_obj = shared_array.get_obj() if is_locked_array else shared_array
    return np.frombuffer(shared_array_obj, dtype=np.float64).reshape(shape[0], shape[1])

def init_pool_processes(shared_array, shape, is_locked_array):
    """
    Init each pool process.
    The numpy array is created from the passed shared array and a global
    variable is initialized with a reference to it.
    """
    global np_array
    np_array = np_array_from_shared_array(shared_array, shape, is_locked_array)

def change_array(i, j):
    np_array[i, j] += 100

if __name__ == '__main__':
    data = np.array([[1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7], [7.7, 6.6, 5.5, 4.4, 3.3, 2.2, 1.1]])
    shape = data.shape
    # Specify lock=True if multiple processs will be updating the same
    # array element.
    # Each task will specify a unique element, so no locking is required:
    NEEDS_LOCKING = False
    shared_array = Array('d', shape[0] * shape[1], lock=NEEDS_LOCKING)
    # Wrap np_array as an numpy array so we can easily manipulates its data.
    np_array = np_array_from_shared_array(shared_array, shape, NEEDS_LOCKING)
    # Copy data to our shared array.
    np.copyto(np_array, data)

    # Before
    print(np_array)

    # Init each process in the pool with shared_array:
    pool = Pool(initializer=init_pool_processes, initargs=(shared_array, shape, NEEDS_LOCKING))
    result = pool.starmap(change_array, ((i, j) for i in range(shape[0]) for j in range(shape[1])))
    pool.close()
    pool.join()

    # After:
    print(np_array)

Prints:印刷:

[[1.1 2.2 3.3 4.4 5.5 6.6 7.7]
 [7.7 6.6 5.5 4.4 3.3 2.2 1.1]]
[[101.1 102.2 103.3 104.4 105.5 106.6 107.7]
 [107.7 106.6 105.5 104.4 103.3 102.2 101.1]]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM