Python 随着使用更多内核，多处理速度变慢

Question

I'm using Python Multiprocessing module to diagonalize a (large and sparse) matrix multiple times.我正在使用 Python 多处理模块多次对角化（大而稀疏）矩阵。 I have to do this for a thousand times, so decided to do it in multiprocessing as I have 24 cores.我必须这样做一千次，所以决定在多处理中进行，因为我有 24 个内核。

The code looks like:代码如下：

import numpy as np
from scipy.sparse.linalg import eigsh
from scipy import sparse

def diag(param):
    wf, vf = 0, 0
    for i in range(10000):
        num = np.random.rand()
        .... # unrelated code producing the matrix with param and num

        Mat = sparse.csc_matrix((data, (row, col)), shape=(8000, 8000))
        w, v = eigsh(Mat, k=12)

        .... #some other unrelated process updating wf and vf using w and v


    return wf, vf

def Final(temp):
    print("Process " % multiprocessing.current_process().name)
    print(temp)
    np.random.seed()
    w0, v0 = diag(temp)
    
    .... #unrelated process using w0 and v0 
    
if __name__ == '__main__':
    with Pool() as p:
        print(p.map(Final, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]))

Other parts of the code are irrelevant, as diagonalising a 8000 by 8000 sparse matrix is the rate determining step in my case.代码的其他部分无关紧要，因为对角化 8000 x 8000 稀疏矩阵是我案例中的速率确定步骤。

The process worked well when I did not use multiprocessing.当我不使用多处理时，这个过程运行良好。 However, the speed when I implement multiprocessing is now inversely proportional to the number of cores used(.,), I'll be grateful for any input on this.但是，我实现多处理时的速度现在与使用的内核数成反比（.,），我将不胜感激任何关于此的输入。 as I understand pool in general does make your process (a tiny bit) slower.据我了解，池通常会使您的过程（稍微）变慢。 but not by this much.但不是这么多。 I'm very confused as scipy inherently does not have multiprocessing implemented.我很困惑，因为 scipy 本质上没有实现多处理。

Example.例子。 Normally in one core, 10 diagonalizations take around 2s.通常在一个核中，10 次对角化大约需要 2 秒。 In 24 cores (in the example above 10 cores,).在 24 个内核中（在上面的示例中为 10 个内核）。 it will take ~40s.大约需要 40 秒。

Edit: As a context, the matrix is very sparse - only 48000 entries in 8000 by 8000 matrix.编辑：作为上下文，矩阵非常稀疏——8000 x 8000 矩阵中只有 48000 个条目。

Edit: Solved, but still remaining questions.编辑：已解决，但仍然存在问题。

I've solved the issue, and it is very interesting and I want your input.我已经解决了这个问题，这很有趣，我需要你的意见。

The issue was as follows: When scipy.sparse diagonalises a matrix bigger than some threshold, then it automatically multithreads (which I checked with top.).问题如下：当 scipy.sparse 对大于某个阈值的矩阵进行对角化时，它会自动进行多线程处理（我用 top 检查过）。 However, this does NOT increase the speed significantly compared to using a single core case.但是，与使用单核案例相比，这并没有显着提高速度。

I checked the performance with my own laptop (dual core, nothing fancy,), and the performance was better(.) than the 24 core output (which has slightly bigger number of flops, but still.) and realised that the automatic multithreading is not doing anything, but just building up the queue and blocking multiprocessing.我用我自己的笔记本电脑（双核，没什么特别的）检查了性能，性能比 24 核 output（它的触发器数量稍多，但仍然。）更好（。）并意识到自动多线程是什么都不做，只是建立队列并阻止多处理。

Therefore, the solution was to set MKL and BLAS to single thread using os and then multithread - of which the programme runs very well now.因此，解决方案是使用 os 将 MKL 和 BLAS 设置为单线程，然后再设置多线程 - 现在程序运行良好。 I'm now curious, as why does BLAS multithread on the first place, but do not utilise the multithreads at all - it might be a developer issue, but there might be another convoluted solution.我现在很好奇，为什么 BLAS 多线程首先出现，但根本不使用多线程——这可能是开发人员的问题，但可能还有另一个复杂的解决方案。 Who knows!谁知道！

Answer 1

I solved similar issue by adding following lines我通过添加以下行解决了类似的问题

import os
os.environ["MKL_NUM_THREADS"] = "1" 
os.environ["NUMEXPR_NUM_THREADS"] = "1" 
os.environ["OMP_NUM_THREADS"] = "1"

Python 随着使用更多内核，多处理速度变慢

问题描述

1 个解决方案

解决方案1
0 2022-03-14 05:15:57

Python 随着使用更多内核，多处理速度变慢

问题描述

1 个解决方案

解决方案1 0 2022-03-14 05:15:57

解决方案1
0 2022-03-14 05:15:57