[英]CPU Time for matrix matrix multiplication
I am trying to decide wether several similar but independent problems should be dealt with simultaneously or sequentially (possibly in parallel on different computers).我试图决定几个相似但独立的问题应该同时还是按顺序处理(可能在不同的计算机上并行处理)。 In order to decide, I need to compare the cpu times of the following operations :
为了做出决定,我需要比较以下操作的 CPU 时间:
time_1 is the time for computing X(with shape (n,p)) @ b (with shape (p,1)). time_1 是计算 X(with shape (n,p)) @ b (with shape (p,1)) 的时间。
time_k is the time for computing X(with shape (n,p)) @ B (with shape (p,k)). time_k 是计算 X(with shape (n,p)) @ B (with shape (p,k)) 的时间。
where X, b and B are random matrices.其中 X、b 和 B 是随机矩阵。 The difference between the two operations is the width of the second matrix.
两个操作之间的区别是第二个矩阵的宽度。
Naively, we expect that time_k = kx time_1.天真地,我们期望 time_k = kx time_1。 With faster matrix multiplication algorithms (Strassen algorithm, Coppersmith–Winograd algorithm), time_k could be smaller than kx time_1 but the complexity of these algorithms remains much larger than what I observed in practice.
使用更快的矩阵乘法算法(Strassen 算法、Coppersmith-Winograd 算法),time_k 可能小于 kx time_1,但这些算法的复杂度仍然比我在实践中观察到的要大得多。 Therefore my question is : How to explain the large difference in terms of cpu times for these two computations ?
因此我的问题是:如何解释这两种计算在 cpu 时间方面的巨大差异?
The code I used is the following :我使用的代码如下:
import time
import numpy as np
import matplotlib.pyplot as plt
p = 100
width = np.concatenate([np.arange(1, 20), np.arange(20, 100, 10), np.arange(100, 4000, 100)]).astype(int)
mean_time = []
for nk, kk in enumerate(width):
timings = []
nb_tests = 10000 if kk <= 300 else 100
for ni, ii in enumerate(range(nb_tests)):
print('\r[', nk, '/', len(width), ', ', ni, '/', nb_tests, ']', end = '')
x = np.random.randn(p).reshape((1, -1))
coef = np.random.randn(p, kk)
d = np.zeros((1, kk))
start = time.time()
d[:] = x @ coef
end = time.time()
timings.append(end - start)
mean_time.append(np.mean(timings))
mean_time = np.array(mean_time)
fig, ax = plt.subplots(figsize =(14,8))
plt.plot(width, mean_time, label = 'mean(time\_k)')
plt.plot(width, width*mean_time[0], label = 'k*mean(time\_1)')
plt.legend()
plt.xlabel('k')
plt.ylabel('time (sec)')
plt.show()
You aren't only timing multiplication operation.你不仅仅是计时乘法运算。
time.time()
takes time to complete. time.time()
需要时间来完成。
>>> print(time.time() - time.time())
-9.53674316406e-07
When multiplied by the number of tries (10000) then the number of instances it becomes significant overhead, for n=100 you are in fact comparing what is 1.000.000 calls to time.time()
to 100 regular numpy array multiplications.当乘以尝试次数 (10000) 时,实例数就变成了显着的开销,对于 n=100,您实际上将 1.000.000 次调用
time.time()
与 100 次常规 numpy 数组乘法进行比较。
For quick benchmarking, Python provides a dedicated module that doesn't have this problem : see timeit为了快速进行基准测试,Python 提供了一个没有这个问题的专用模块:参见timeit
This detail of the reason is very complex.这个细节的原因很复杂。 You know that when PC run the
X @ b
, it will execute many other required instructions, maybe load data from RAM to cache
and so on.你知道当 PC 运行
X @ b
,它会执行许多其他需要的指令,可能load data from RAM to cache
等等。 In other words, the cost time contains two parts - the 'real calculate instructions' in CPU represented by Cost_A
and 'other required instructions' represented by Cost_B
.换句话说,成本时间包含两个部分- “真实计算指示”在CPU由下式表示
Cost_A
和由下式表示“其它所需的指令” Cost_B
。 I have a idea, just my guess, that it's the Cost_B
lead to time_k << kx time_1
.我有一个想法,只是我的猜测,它是
Cost_B
导致time_k << kx time_1
。
For the shape of b is small (eg 1000 x 1), the 'other required instructions' cost relatively the most time.由于 b 的形状很小(例如 1000 x 1),“其他所需指令”相对花费的时间最多。 For the shape of b is huge (eg 1000 x 10000), it's relatively small.
由于 b 的形状很大(例如 1000 x 10000),它相对较小。 The following group of experiments could give a less rigorous proof.
下面的一组实验可以给出一个不太严格的证明。 We can see that when the shape of b increases from (1000 x 1) to (1000 x ) the cost time increases very slowly.
我们可以看到,当 b 的形状从 (1000 x 1) 增加到 (1000 x ) 时,成本时间增加非常缓慢。
import numpy as np
import time
X = np.random.random((1000, 1000))
b = np.random.random((1000, 1))
b3 = np.random.random((1000, 3))
b5 = np.random.random((1000, 5))
b7 = np.random.random((1000, 7))
b9 = np.random.random((1000, 9))
b10 = np.random.random((1000, 10))
b30 = np.random.random((1000, 30))
b60 = np.random.random((1000, 60))
b100 = np.random.random((1000, 100))
b1000 = np.random.random((1000, 1000))
def test_cost(X, b):
begin = time.time()
for i in range(100):
_ = X @ b
end = time.time()
print((end-begin)/100.)
test_cost(X, b)
test_cost(X, b3)
test_cost(X, b5)
test_cost(X, b7)
test_cost(X, b9)
test_cost(X, b10)
test_cost(X, b30)
test_cost(X, b60)
test_cost(X, b100)
test_cost(X, b1000)
output:
0.0003210139274597168
0.00040063619613647463
0.0002452659606933594
0.00026523590087890625
0.0002449488639831543
0.00024344682693481446
0.00040068864822387693
0.000691361427307129
0.0011700797080993653
0.009680757522583008
For more, I do a set of experiments with pref
in linux.对于更多,我在linux中用
pref
做了一组实验。 For the pref
, the Cost_B
maybe more big.对于
pref
, Cost_B
可能更大。 I have 8 python files, the first one is as follows.我有8个python文件,第一个如下。
import numpy as np
import time
def broken2():
mtx = np.random.random((1, 1000))
c = None
c = mtx ** 2
broken2()
I had process the output to table A, as follows.我已将输出处理到表 A,如下所示。
I do a simple analysis that I divide the error of the number of operation (likes, cache-misses) in neighbor experiments by the error of time elapsed(seconds)
.我做了一个简单的分析,我将邻居实验中操作次数(喜欢,缓存未命中)的误差除以
time elapsed(seconds)
的time elapsed(seconds)
误差time elapsed(seconds)
。 Then, I get the following table B. From the table, we can find that as the shape of b increasing the linear relation between of shape and cost time is more obvious.然后,我得到下表B。从表中我们可以发现,随着b的形状的增加,形状与成本时间之间的线性关系更加明显。 And maybe the main reason that lead to
time_k << kx time_1
is cache misses
(load data from RAM to cache), for it stabilized firstly .也许导致
time_k << kx time_1
是cache misses
(将数据从 RAM 加载到缓存),因为它首先稳定了。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.