简体   繁体   English

numpy矩阵乘法的奇怪性能结果

[英]Strange performance results for numpy matrix multiplication

Recently I've discovered a case in which matrix multiplication with numpy shows very strange performance (at least to me). 最近,我发现了一种情况,其中与numpy相乘的矩阵乘法显示出非常奇怪的性能(至少对我而言)。 To illustrate it I've created an example of such matrices and a simple script to demonstrate the timings. 为了说明这一点,我创建了此类矩阵的示例以及一个简单的脚本来演示计时。 Both can be downloaded from the repo , and I don't include the script here because it's of little use without the data. 两者都可以从repo下载,我在这里不包括脚本,因为没有数据就没什么用了。

The script multiplies two pairs of matrices (each pair is the same in terms of shape and dtype , only the data differs) in different ways using both dot function and einsum . 该脚本使用dot函数和einsum以不同的方式将两对矩阵相乘(每对矩阵的shapedtype ,只是数据不同)。 Actually, I've noticed several anomalies: 实际上,我注意到了一些异常情况:

  • The first pair ( A * B ) is multiplied much faster than the second one ( C * D ). 第一对( A * B )的乘法比第二对( C * D )快得多。
  • When I convert all matrices to float64 , the times become the same for both pairs: longer than it took to multiply A * B , but shorter than C * D . 当我将所有矩阵都转换为float64 ,两对的时间变得相同:比乘A * B时间更长,但比C * D所需的时间短。
  • These effects remain for both einsum (numpy implementation, as I understand) and dot (uses BLAS at my machine). 这些效果对于einsum (据我理解为numpy实现)和dot (在我的计算机上使用BLAS)都保持不变。 For the sake of completeness, the output of this script at my laptop: 为了完整起见,此脚本在我的笔记本电脑上的输出:
 With np.dot: A * B: 0.142910003662 s C * D: 4.9057161808 s A * D: 0.20524597168 s C * B: 4.20220398903 s A * B (to float32): 0.156805992126 s C * D (to float32): 5.11792707443 s A * B (to float64): 0.52608704567 s C * D (to float64): 0.484733819962 s A * B (to float64 to float32): 0.255760908127 s C * D (to float64 to float32): 4.7677090168 s With einsum: A * B: 0.489732980728 s C * D: 7.34477996826 s A * D: 0.449800014496 s C * B: 4.05954909325 s A * B (to float32): 0.411967992783 s C * D (to float32): 7.32073783875 s A * B (to float64): 0.80580997467 s C * D (to float64): 0.808521032333 s A * B (to float64 to float32): 0.414498090744 s C * D (to float64 to float32): 7.32472801208 s 

How can such results be explained, and how to multiply C * D faster, like A * B ? 如何解释这样的结果,以及如何像A * B一样更快地将C * D相乘?

The slowdown you're seeing is due to calculations involving subnormal numbers . 您看到的速度下降是由于涉及非正规数的计算。 Many processors are much slower when performing arithmetic operations with subnormal inputs or outputs. 当使用非正规输入或输出执行算术运算时,许多处理器的速度要慢得多。 There are a couple of existing StackOverflow questions that are related: see this related C# question (and in particular the answer from Eric Postpischil), and this answer to a C++ question for more information. 有一对夫妇的现有StackOverflow的问题,这是相关的:看到这个相关的C#的问题 (特别是答案埃里克Postpischil),并且这个答案为C ++的问题以获取更多信息。

In your specific case, the matrix C (with a dtype of float32 ) contains several subnormal numbers. 在您的特定情况下,矩阵C (具有float32 )包含多个次正规数。 For single-precision floats, the subnormal / normal boundary is 2^-126 , or around 1.18e-38 . 对于单精度浮点,次法线/法线边界为2^-126或大约1.18e-38 Here's what I see for C : 这是我对C的看法:

>>> ((0 < abs(C)) & (abs(C) < 2.0**-126)).sum()  # number of subnormal entries
44694
>>> C.size
682450

So around 6.5% of C's entries are subnormal, which is more than enough to slow down the C*B and C*D multiplications. 因此,约有6.5%的C项是次正规的,这足以减慢C*BC*D乘法。 In contrast, A and B don't go near the subnormal boundary: 相反, AB不在次法线边界附近:

>>> abs(A[A != 0]).min()
4.6801152e-12
>>> abs(B[B != 0]).min()
4.0640174e-07

So none of the intermediate values involved in the A*B matrix multiplication is subnormal, and no speed penalty applies. 因此, A*B矩阵乘法所涉及的中间值都不是次正规的,并且不施加速度损失。

As to the second part of your question, I'm not sure what to suggest. 至于您的问题的第二部分,我不确定要提出什么建议。 If you try hard enough, and you're using x64/SSE2 (rather than the x87 FPU), you can set the flush-to-zero and denormals-are-zero flags from Python. 如果您尽力而为,并且使用的是x64 / SSE2(而不是x87 FPU),则可以从Python设置“刷新为零”和“非规范为零”标志。 See this answer for a crude and non-portable ctypes-based hack; 有关基于ctypes的粗略且不可移植的hack,请参见此答案 if you really want to follow this route, writing a custom C extension to do this might be a better bet. 如果您真的想遵循这条路线,最好编写一个自定义C扩展名来做到这一点。

I'd be tempted instead to try scaling C to bring it entirely into the normal range (and to bring the individual products from C*D into the normal range, too), but that might not be possible if C also has values at the upper extremes of the floating-point range. 相反,我很想尝试缩放C以使其完全处于正常范围内(并将C*D的各个乘积也带到正常范围内),但是如果C值也等于C ,则可能无法实现。浮点范围的上限。 Alternatively, simply replacing the tiny values in C with zeros might work, but whether the resulting accuracy loss is significant and/or acceptable would depend on your application. 另外,可以简单地用零代替C的微小值,但是结果导致的精度损失是否显着和/或可以接受将取决于您的应用程序。

Mark Dickinson already answered your question, but just for fun, try this: Mark Dickinson已经回答了您的问题,但是为了好玩,请尝试以下操作:

Cp = np.array(list(C[:,0]))
Ap = np.array(list(A[:,0]))

This removes the splicing delay and makes sure that the arrays are similar in memory. 这样可以消除拼接延迟,并确保阵列在内存中相似。

%timeit Cp * Cp   % 34.9 us per loop
%timeit Ap * Ap   % 3.59 us per loop

Whoops. 哎呦。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM