Strange performance results for numpy matrix multiplication

Question

Recently I've discovered a case in which matrix multiplication with numpy shows very strange performance (at least to me). To illustrate it I've created an example of such matrices and a simple script to demonstrate the timings. Both can be downloaded from the repo , and I don't include the script here because it's of little use without the data.

The script multiplies two pairs of matrices (each pair is the same in terms of shape and dtype , only the data differs) in different ways using both dot function and einsum . Actually, I've noticed several anomalies:

The first pair ( A * B ) is multiplied much faster than the second one ( C * D ).
When I convert all matrices to float64 , the times become the same for both pairs: longer than it took to multiply A * B , but shorter than C * D .
These effects remain for both einsum (numpy implementation, as I understand) and dot (uses BLAS at my machine). For the sake of completeness, the output of this script at my laptop:

 With np.dot: A * B: 0.142910003662 s C * D: 4.9057161808 s A * D: 0.20524597168 s C * B: 4.20220398903 s A * B (to float32): 0.156805992126 s C * D (to float32): 5.11792707443 s A * B (to float64): 0.52608704567 s C * D (to float64): 0.484733819962 s A * B (to float64 to float32): 0.255760908127 s C * D (to float64 to float32): 4.7677090168 s With einsum: A * B: 0.489732980728 s C * D: 7.34477996826 s A * D: 0.449800014496 s C * B: 4.05954909325 s A * B (to float32): 0.411967992783 s C * D (to float32): 7.32073783875 s A * B (to float64): 0.80580997467 s C * D (to float64): 0.808521032333 s A * B (to float64 to float32): 0.414498090744 s C * D (to float64 to float32): 7.32472801208 s

How can such results be explained, and how to multiply C * D faster, like A * B ?

Answer 1

The slowdown you're seeing is due to calculations involving subnormal numbers . Many processors are much slower when performing arithmetic operations with subnormal inputs or outputs. There are a couple of existing StackOverflow questions that are related: see this related C# question (and in particular the answer from Eric Postpischil), and this answer to a C++ question for more information.

In your specific case, the matrix C (with a dtype of float32 ) contains several subnormal numbers. For single-precision floats, the subnormal / normal boundary is 2^-126 , or around 1.18e-38 . Here's what I see for C :

>>> ((0 < abs(C)) & (abs(C) < 2.0**-126)).sum()  # number of subnormal entries
44694
>>> C.size
682450

So around 6.5% of C's entries are subnormal, which is more than enough to slow down the C*B and C*D multiplications. In contrast, A and B don't go near the subnormal boundary:

>>> abs(A[A != 0]).min()
4.6801152e-12
>>> abs(B[B != 0]).min()
4.0640174e-07

So none of the intermediate values involved in the A*B matrix multiplication is subnormal, and no speed penalty applies.

As to the second part of your question, I'm not sure what to suggest. If you try hard enough, and you're using x64/SSE2 (rather than the x87 FPU), you can set the flush-to-zero and denormals-are-zero flags from Python. See this answer for a crude and non-portable ctypes-based hack; if you really want to follow this route, writing a custom C extension to do this might be a better bet.

I'd be tempted instead to try scaling C to bring it entirely into the normal range (and to bring the individual products from C*D into the normal range, too), but that might not be possible if C also has values at the upper extremes of the floating-point range. Alternatively, simply replacing the tiny values in C with zeros might work, but whether the resulting accuracy loss is significant and/or acceptable would depend on your application.

Answer 2

Mark Dickinson already answered your question, but just for fun, try this:

Cp = np.array(list(C[:,0]))
Ap = np.array(list(A[:,0]))

This removes the splicing delay and makes sure that the arrays are similar in memory.

%timeit Cp * Cp   % 34.9 us per loop
%timeit Ap * Ap   % 3.59 us per loop

Whoops.

Strange performance results for numpy matrix multiplication

Question

2 answers

solution1
4 ACCPTED 2014-07-10 17:46:22

solution2
1 2014-07-09 21:16:50

Strange performance results for numpy matrix multiplication

Question

2 answers

solution1 4 ACCPTED 2014-07-10 17:46:22

solution2 1 2014-07-09 21:16:50

solution1
4 ACCPTED 2014-07-10 17:46:22

solution2
1 2014-07-09 21:16:50