简体   繁体   English

ProcessPoolExecutor 开销?? 对于大型矩阵运算,并行处理比单个处理花费更多时间

[英]ProcessPoolExecutor overhead ?? parallel processing takes more time than single process for large size matrix operation

My python code contains a numpy dot operation of huge size (over 2^(tens...)) matrix and vector.我的 python 代码包含一个 numpy 大尺寸(超过 2^(十...))矩阵和向量的点操作。 To reduce the computing time, I applied parallel processing by dividing the matrix suited for the number of cpu cores.为了减少计算时间,我通过划分适合 cpu 核心数量的矩阵来应用并行处理。 I used concurrent.futures.ProcessPoolExecutor .我使用concurrent.futures.ProcessPoolExecutor My issue is that the parallel processing takes much more time than single processing.我的问题是并行处理比单一处理花费更多的时间。

The following is my code.以下是我的代码。

  1. single process code.单进程代码。
self._vector = np.dot(matrix, self._vector)
  1. parallel processing code.并行处理代码。
    each_worker_coverage = int(self._dimension/self.workers)
         args = []
         for i in range(self.workers):
             if (i+1)*each_worker_coverage < self._dimension:
                 arg = [i, gate[i*each_worker_coverage:(i+1)*each_worker_coverage], self._vector]
             else:
                 arg = [i, gate[i*each_worker_coverage:self._dimension], self._vector]
             args.append(arg)
         pool = futures.ProcessPoolExecutor(self.workers)
         results = list(pool.map(innerproduct, args, chunksize=1))
         for result in results:
             if (result[0]+1)*each_worker_coverage < self._dimension:
                 self._vector[result[0]*each_worker_coverage:(result[0]+1)*each_worker_coverage] = result[1]
             else:
                 self._vector[result[0]*each_worker_coverage:self._dimension] = result[1]

The innerproduct function called in parallel is as follows.并行调用的内积function如下。

def innerproduct(args):
    answer = np.dot(args[1], args[2])
    return args[0], answer
    ```

For a 2^14 x 2^14 matrix and a 2^14 vector, the single process code takes only 0.05 seconds, but the parallel processing code takes 6.2 seconds.
I also checked the time with the `innerproduct` method, and it only takes 0.02~0.03 sec.
I don't understand this situation.
Why does the parallel processing (multi-processing not multi-threading) take more time?



To exactly know the cause of the slowdown you would have to measure how long everything takes, and with multiprocessing and multithreading that can be tricky.要准确了解速度下降的原因,您必须衡量一切需要多长时间,并且使用多处理和多线程可能会很棘手。

So what follows is my best guess.所以接下来是我最好的猜测。 For multiprocessing to work, the parent process has to transfer the data used in the calculations to the worker processes.为了使多处理工作,父进程必须将计算中使用的数据传输到工作进程。 The time this takes depends on the amount of data.这需要的时间取决于数据量。 Transferring a 2^14 x 2^14 matrix is probably going to take a significant amount of time.传输 2^14 x 2^14 矩阵可能会花费大量时间。 I suspect that this data transfer is what is taking the extra time.我怀疑这种数据传输需要额外的时间。

If you are using an operating system that uses the fork startmethod for multiprocessing / concurrent.futures , there is a way around this data transfer .如果您使用的操作系统使用fork start 方法进行multiprocessing / concurrent.futures则有一种方法可以绕过这种数据传输 These operating systems are for example Linux, *BSD and macOS (but not ms-windows).这些操作系统例如 Linux、*BSD 和 macOS(但不是 ms-windows)。

On the abovementioned operating systems, multiprocessing uses the fork system call to create its workers.在上述操作系统上,多处理使用fork系统调用来创建它的工作者。 This system call creates a copy of the parent process as the child processes.该系统调用将父进程的副本创建为子进程。 So if you create the vectors and matrices before creating the ProcessPoolExecutor , the workers will inherit that data.因此,如果您在创建ProcessPoolExecutor之前创建向量和矩阵,工作人员将继承该数据。 This is not a very costly or time consuming operation because all these OS's use copy-on-write for managing memory pages.这不是一个非常昂贵或耗时的操作,因为所有这些操作系统都使用写时复制来管理 memory 页面。 As long as the original matrix isn't changed, all programs using it are reading from the same memory pages.只要原始矩阵没有改变,所有使用它的程序都从相同的 memory 页面读取。 This inheriting of the data means you don't have to pass the data explicitly to the worker.数据的这种继承意味着您不必将数据显式传递给工作人员。 You just have to pass a small data structure that says on which index ranges a worker has to operate.你只需要传递一个小数据结构,说明工人必须在哪些索引范围内操作。

Unfortunately, due to technical limitations of the platform, this doesn't work on ms-windows.不幸的是,由于平台的技术限制,这在 ms-windows 上不起作用。 What you could do on ms-windows is store the original matrix and vector memory mapped binary files before you create the Executor .可以在 ms-windows 上做的是在创建Executor之前存储原始矩阵和向量 memory 映射的二进制文件。 If you tag these mappings with a name, the worker processes should be able to map the same data into their memory without having to transfer them.如果您使用名称标记这些映射,则工作进程应该能够将 map 相同的数据放入其 memory 而无需传输它们。 I think is it possible to instruct numpy to use such a raw binary array without recreating it.我认为是否可以指示numpy使用这样的原始二进制数组而不重新创建它。

On both platforms you could use the same technique to "send data back" to the parent process;在这两个平台上,您都可以使用相同的技术将数据“发送回”到父进程; save the data in shared memory and return the filename or tagname to the parent process.将数据保存在共享 memory 中,并将文件名或标记名返回给父进程。

If you are using modern versions of NumPy and OS, it's most likely that如果您使用的是现代版本的 NumPy 和操作系统,则很可能是

self._vector = np.dot(matrix, self._vector)

is already optimized and uses all your CPU cores.已经优化并使用您所有的 CPU 内核。

If np.show_config() displays openblas or MKL you may run a simple test:如果np.show_config()显示openblasMKL ,您可以运行一个简单的测试:

a = np.random.rand(7000, 7000)
b = np.random.rand(7000, 7000)
np.dot(a, b)

It should use all CPU cores for a couple of seconds.它应该使用所有 CPU 内核几秒钟。

If it's not, you may install OpenBLAS or MKL and reinstall NumPy.如果不是,您可以安装 OpenBLAS 或 MKL 并重新安装 NumPy。 See Using MKL to boost Numpy performance on Ubuntu请参阅使用 MKL 提高 Ubuntu 上的 Numpy 性能

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 ProcessPoolExecutor 进行并行处理 - Parallel processing with ProcessPoolExecutor numpy数组处理比列表处理python需要更多时间 - numpy array processing takes more time than list processing python 如果进程(python代码)执行时间超过指定时间,则发送警报(邮件) - Send alert (mail) if a process (python code) takes more than specified time to execute 使用tf.slim进行多GPU训练比单GPU花费更多的时间 - Multi-GPU training using tf.slim takes more time than single GPU ProcessPoolExecutor锁定不必要的更多期货 - ProcessPoolExecutor locks more futures than necessary 使用 ProcessPoolExecutor 进行并行处理不会返回错误 - Parallel processing with ProcessPoolExecutor doesn´t work without returning an error Python多处理/线程比虚拟机上的单个处理花费更长的时间 - Python multiprocessing/threading takes longer than single processing on a virtual machine 为什么python joblib并行处理比单CPU慢? - Why is python joblib parallel processing slower than single cpu? Apache Beam中的批处理,开销很大 - Batch Processing in Apache Beam with large overhead 同时进行多个流程 - More than one process at the same time
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM