简体   繁体   中英

Python multiprocessing gives slower speed as more cores are used

I'm using Python Multiprocessing module to diagonalize a (large and sparse) matrix multiple times. I have to do this for a thousand times, so decided to do it in multiprocessing as I have 24 cores.

The code looks like:

import numpy as np
from scipy.sparse.linalg import eigsh
from scipy import sparse

def diag(param):
    wf, vf = 0, 0
    for i in range(10000):
        num = np.random.rand()
        .... # unrelated code producing the matrix with param and num

        Mat = sparse.csc_matrix((data, (row, col)), shape=(8000, 8000))
        w, v = eigsh(Mat, k=12)

        .... #some other unrelated process updating wf and vf using w and v


    return wf, vf

def Final(temp):
    print("Process " % multiprocessing.current_process().name)
    print(temp)
    np.random.seed()
    w0, v0 = diag(temp)
    
    .... #unrelated process using w0 and v0 
    
if __name__ == '__main__':
    with Pool() as p:
        print(p.map(Final, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])) 

Other parts of the code are irrelevant, as diagonalising a 8000 by 8000 sparse matrix is the rate determining step in my case.

The process worked well when I did not use multiprocessing. However, the speed when I implement multiprocessing is now inversely proportional to the number of cores used(.,), I'll be grateful for any input on this. as I understand pool in general does make your process (a tiny bit) slower. but not by this much. I'm very confused as scipy inherently does not have multiprocessing implemented.

Example. Normally in one core, 10 diagonalizations take around 2s. In 24 cores (in the example above 10 cores,). it will take ~40s.

Edit: As a context, the matrix is very sparse - only 48000 entries in 8000 by 8000 matrix.


Edit: Solved, but still remaining questions.

I've solved the issue, and it is very interesting and I want your input.

The issue was as follows: When scipy.sparse diagonalises a matrix bigger than some threshold, then it automatically multithreads (which I checked with top.). However, this does NOT increase the speed significantly compared to using a single core case.

I checked the performance with my own laptop (dual core, nothing fancy,), and the performance was better(.) than the 24 core output (which has slightly bigger number of flops, but still.) and realised that the automatic multithreading is not doing anything, but just building up the queue and blocking multiprocessing.

Therefore, the solution was to set MKL and BLAS to single thread using os and then multithread - of which the programme runs very well now. I'm now curious, as why does BLAS multithread on the first place, but do not utilise the multithreads at all - it might be a developer issue, but there might be another convoluted solution. Who knows!

I solved similar issue by adding following lines

import os
os.environ["MKL_NUM_THREADS"] = "1" 
os.environ["NUMEXPR_NUM_THREADS"] = "1" 
os.environ["OMP_NUM_THREADS"] = "1" 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM