简体   繁体   中英

Anaconda MKL can't set number of threads

I was using numpy from anaconda to benchmark a big matrix multiplication ( 8192x8192 of type float32 ) like this: (in jupyter)

import numpy as np
a = np.empty((8192, 8192), 'f')
%timeit a @ a

The numpy is build against MKL . When doing the multiplication (continuously), I find the CPU usage of python is always 50%. I'm wondering why it isn't 100% (since matrix multiplication should be automatically palatalized). I therefore googled around and find two ways to set the number of threads MKL uses.

One way is directly using the DLL:

from ctypes import CDLL
mkl = CDLL('../conda/pkgs/mkl-2019.0-118/Library/bin/mkl_rt.dll')
print(mkl.MKL_Set_Num_Threads(4))
print(mkl.MKL_Get_Max_Threads())

which I believe gave me some unknown error code and failed to set:

-899695632
2

Another way is through mkl-service package:

import mkl
print(mkl.set_num_threads(4))
print(mkl.get_max_threads())

which also didn't success.

None
2

I'm wondering why is setting 4 threads in MKL keep failing and how to make it work. I'm under Win7 , 64bit . My CPU is i5-2520M which should have 4 core. My anaconda environment looks like: (abbreviated)

mkl                       2019.0                      118
mkl-service               1.1.2            py36hb217b18_5
mkl_fft                   1.0.6            py36hdbbee80_0
mkl_random                1.0.1            py36h77b88f5_1
numpy                     1.15.3           py36ha559c80_0
numpy-base                1.15.3           py36h8128ebf_0
zeromq                    4.2.5                he025d50_1

Please consider this documentation: https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-intel-mkl-100-threading

The key variable is MKL_NUM_THREADS , which you can set as a global Windows variable.

I strongly disagree with @roro on this. The reason, why you are seeing the 50% is that you are not using your hyperthreading capabilities. Having said that, bear in mind, that there are 2 limiting factors to speed of calculation: CPU power and!! memory access bandwidth. Oftentimes the second will limit the speed to say 70% of your CPU power, cause RAM/cache cannot deliver data fast enough to the algorithm.

Getting parallelism right is among the more challenging parts of HPC.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM