Matrix inversion speed in Numpy

Question

I have been reading the book "Hands-on Machine Learning with Scikit-Learn and TensorFlow". In Chapter 4 (page 110), it says: The computational complexity of inverting such a matrix is typically about O(n^2.4) to O(n^3) (depending on the implementation). In other words, if you double the number of features, you multiply the computation time by roughly 2^2.4 = 5.3 to 2^3 = 8.

I tested the above in Python:

import numpy as np
import timeit
nsize = 100
A = np.random.rand(nsize, nsize)
x = np.random.rand(nsize, 1)
b = A.dot(x)
# np.linalg.inv(A) or np.linalg.inv(A).dot(b) - not a big difference in time
timeit.timeit('np.linalg.inv(A)','import numpy as np\nfrom __main__ import A, b', number=10000)

I get about 2.5 seconds for nsize=100 and get about 8.5 seconds for nsize=200 . The compute cost seems to increase a factor of 3.4 (8.5/2.5) which is well below the range mentioned by the author. Can someone explain what causes this "speedup"?

Answer 1

Numpy uses a LU factorization of LAPACK (as pointed out by @Murali in the comments). The LU factorization is usually done by OpenBLAS on most platforms (or the Intel MKL that behave similarly). OpenBLAS use dgemm calls (matrix multiplication) so to make the LU factorization faster. The dgemm is one of the most optimized BLAS primitive for large matrices. It make use of multiple threads and SIMD instructions so to be very fast. That being said, on small matrices, using multiple threads introduce a significant overhead (creating threads, distributing the work and waiting the results takes time). This overhead tends to be negligible for bigger matrices. The same thing is also true for SIMD instructions with even smaller matrices: SIMD instruction are fast for computing relatively large array but they have a high latency compared to scalar ones (that can be better pipelined for small matrices). Loop unrolling also cause the same effect (for very very small matrices) as well as CPU caches. On top of that Numpy does some type checks and internal work (so to optimize inner loops in some other cases) that takes few microseconds. The combination of all these factor cause the code to be very inefficient for small matrices and faster for bigger one. This overhead introduce a bias in the measured speed-up.

On my machine, for matrices of time 100, the dgemm calls take 19% of the time, dtrsm takes 11% and dlaswp takes 7% (and other calls takes a negligible time and the rest is lost in waits or in the kernel). For a matrix of size 200, dgemm calls take 29% and dtrsm 12%, dlaswp 7%. For a matrix of size 1000, dgemm calls take 57% and dtrsm 5%, dlaswp 6%. As we can see, most of the time is wasted on matrices of size 100 while most of the time is use by BLAS operation for matrices of size 1000.

Note the LU factorization is hard to parallelize. This is an active field of research (lasting since decades). Furthermore, the best parallelization methods (typically based on tasks) are unfortunately not used in libraries like OpenBLAS yet. AFAIK, the MKL use them a bit. State of the art implementation are faster, especially for small matrices since the dgemm is properly optimized for large matrices.

The multithreading can be disabled in OpenBLAS using OMP_NUM_THREADS=1 . With that, I got 151 µs for matrices of size 100 and 800 for matrice of size 200. One should consider the ~4 µs of Numpy overhead, and other overheads that are not constant like page faults, allocations, latency of SIMD instructions, cache effects, branch prediction, etc. That being said, the factor reach 5.44 (without the Numpy overheads). For 200 VS 400, it is 5.67. And for 400 VS 800, it is even 6.8. The main reason the factor is still not close to 8 (besides the other overheads) is that OpenBLAS is not optimized to run LU factorization with 1 threads: the internal parameters are set sub-optimaly and the algorithm is not optimal in this case.

Matrix inversion speed in Numpy

Question

1 answers

solution1
0 2022-07-28 00:41:54

Matrix inversion speed in Numpy

Question

1 answers

solution1 0 2022-07-28 00:41:54

solution1
0 2022-07-28 00:41:54