简体   繁体   English

时间乘法矩阵和向量

[英]time multiplying matrices and vectors

I wrote this algorithm that resolves matrix-vector product two ways: Imagine i have a matrix A nxn and a vector X nx1我写了这个算法以两种方式解决矩阵向量乘积:想象我有一个矩阵 A nxn 和一个向量 X nx1

1) inner-product: Y11 = A11*X11 + A12*X21 + .... + A1n*n1 and so on 1) 内积:Y11 = A11*X11 + A12*X21 + .... + A1n*n1 依此类推

2) lineal combination: Y = Colum1mn*X1 + Column2*X2 and so on 2) 线性组合:Y = Colum1mn*X1 + Column2*X2 依此类推

I did this because i wanted to compare who is the fastest way to multiply Matrix-Vector i tried with我这样做是因为我想比较谁是我尝试过的矩阵向量乘法的最快方法

n=[10000,9000,8000,7000,6000,5000,4000,3000,2000,1500,1000,800,500,300,100] n=[10000,9000,8000,7000,6000,5000,4000,3000,2000,1500,1000,800,500,300,100]

run the algorith 3 times and took the average value with every value of n and got this graph , and i was right, inner-product is the fastest because of how is saved matrix in memory thats ok but i ran algorithm 3 times to graph the points and got that Time of N=7000 > time of N=8000 > time of N=9000.运行算法 3 次并取 n 的每个值的平均值并得到这个图,我是对的,内积是最快的,因为如何在内存中保存矩阵,那没关系,但我运行了 3 次算法来绘制图形点并得到 N=7000 的时间 > N=8000 的时间 > N=9000 的时间。 And i want to know why is this?我想知道这是为什么? i thought that it could be something like the computer will start calculating slowly but if i run the algorithm more times it will calculate faster.我认为这可能是因为计算机会开始缓慢计算,但是如果我多次运行该算法,它会计算得更快。 But i ran the algorithm 7 times after the first 3 (10 in total) and got about the same result (this time n=7000 time was about 130 sec not 150 but still is time 7000 > time 8000 > time 9000)但是我在前 3 次(总共 10 次)之后运行了 7 次算法并得到了大致相同的结果(这次 n=7000 时间大约是 130 秒而不是 150 但仍然是时间 7000 > 时间 8000 > 时间 9000)

This is the code i wrote这是我写的代码

def Lineal_C(A,x,n):
    y=np.zeros(n)    
    t = time.clock()
    for j in range(n):
       for i in range(n):
          y[i]=y[i]+A[i][j]*x[j][0]
    time_spent = time.clock() - t
    print ("%.10f sec" % (time_spent)+" n"+str(n)+" Lineal Combination ")

def Inner_P(A,x,n):
    y=np.zeros(n)
    t = time.clock()
    for i in range(0,n):
        for j in range(n):
            y[i]=y[i]+A[i][j]*x[j][0]

    time_spent = time.clock() - t
    print ("%.10f sec" % (time_spent)+" n="+str(n)+" ------MatVectFila - Producto Interno--------")

At first I thought the answer might involve some important detail pertaining to the actual size of the data and differences in execution paths based on size, but at the end of the day the loops are in Python.起初我认为答案可能涉及一些与数据的实际大小和基于大小的执行路径差异有关的重要细节,但最终循环是在 Python 中进行的。 To the best of my understanding, the standard Python interpreter doesn't do anything fancy (eg, no JIT).据我所知,标准的 Python 解释器不会做任何花哨的事情(例如,没有 JIT)。

The important part of the linked gist is that all the calculations are being performed on separate, non-blocking threads.链接要点的重要部分是所有计算都在单独的非阻塞线程上执行。 Even though the sizes are specified from largest to smallest, the printouts are reported from smallest to largest.即使尺寸是从最大到最小指定的,打印输出也是从最小到最大报告的。 This is because the code starts all the threads in parallel (well, in rapid succession) and the smallest ones finish first (despite starting slightly later).这是因为代码并行启动所有线程(好吧,快速连续),最小的线程最先完成(尽管启动稍晚)。 This means that eventually only the large data threads will be running, and they will be relatively faster towards the end.这意味着最终只有大数据线程会运行,并且它们最终会相对更快。

If you run each data size sequentially, then you never see these strange timing effects because each size is no longer "fighting for cycles" with the other sizes.如果您按顺序运行每个数据大小,那么您永远不会看到这些奇怪的计时效果,因为每个大小不再与其他大小“争夺周期”。 Note, this may be actually fighting for cycles in the processor or fighting for interpretation time (wording?) with the global interpreter lock.请注意,这实际上可能是在争夺处理器中的周期或与全局解释器锁争夺解释时间(措辞?)。

The code below forces each size to complete before running the next size:下面的代码强制每个尺寸在运行下一个尺寸之前完成:

    for value in n_values:
       t1 = threading.Thread(target=threadss , args=(value,))
       t1.start()
       t1.join() #forces t1 to finish before the next loop iteration

At this point I thought I had figured out the problem in that larger sizes were executing later, and thus could run relatively faster since they didn't have to share processing time with as many threads.在这一点上,我想我已经发现问题在于更大的尺寸稍后执行,因此可以相对更快地运行,因为它们不必与尽可能多的线程共享处理时间。 However, if they start at the same time, running faster for just a bit shouldn't make something take less time overall, just less time per unit size (which is not what was measured).但是,如果它们同时开始,运行得更快一点不应该使某些事情总体上花费更少的时间,而只是减少每个单位大小的时间(这不是测量的)。 For example, if you start n=7000 and n=8000 nearly simultaneously, the n=8000 should start to compute slightly faster once n=7000 has finished, but it shouldn't execute faster than n=7000.例如,如果您几乎同时启动 n=7000 和 n=8000,则 n=7000 完成后,n=8000 的计算速度应该会稍微快一些,但它的执行速度不应超过 n=7000。 So, something was missing.所以,缺少了一些东西。

Interestingly, if you add a print statement to the beginning of the loops, you should notice that the loop threads execute nearly sequentially from small to large thread sizes.有趣的是,如果您在循环的开头添加一个打印语句,您应该注意到循环线程几乎按顺序从小线程到大线程执行。 At first I though this was something strange with the threading process, but it turns out the real culprit is this line:起初,我认为线程处理过程有些奇怪,但事实证明,真正的罪魁祸首是这一行:

matrix, vector = ranMatrix(value)

That function call builds up a list of lists of numbers, something that is hugely inefficient in Python as compared to something like numpy - so this takes a non-neglible amount of time (which is likely exacerbated by Python's Global Interpreter Lock).该函数调用构建了一个数字列表列表,与 numpy 之类的东西相比,这在 Python 中效率极低 - 因此这需要不可忽略的时间(这可能会因 Python 的全局解释器锁而加剧)。 This call is outside of the timer and has the effect of delaying the start of larger data loops until after smaller ones have finished (again, visible if you print at the start of the loops).此调用在计时器之外,并具有将较大数据循环的开始延迟到较小数据循环完成后的效果(同样,如果您在循环开始时打印,则可见)。 Thus, my assumption of all the loops starting at the same time is incorrect, and in reality large data loops don't start until after smaller data loops because of this initialization that isn't timed.因此,我对所有循环同时开始的假设是不正确的,实际上大数据循环直到较小的数据循环之后才会开始,因为这种初始化没有定时。 Therefore I think that my above conclusion is correct, I just hadn't realized the importance of the delayed start.所以我认为我上面的结论是正确的,只是我没有意识到延迟启动的重要性。

So to reiterate, when 7000 is running, 8000, 9000, and 10000 are still building their arrays.所以重申一下,当 7000 正在运行时,8000、9000 和 10000 仍在构建他们的阵列。 When 8000 is running, 7000 is done (as are the lower sizes), and only 9000 and 10000 are building their arrays.当 8000 正在运行时,7000 已完成(与较小的尺寸一样),并且只有 9000 和 10000 正在构建它们的阵列。 At some point it appears that the faster execution per data point versus the larger array/matrix size tips in favor of the faster execution, meaning that overall the execution time drops.在某些时候,每个数据点的执行速度与更大的数组/矩阵大小相比,似乎有利于更快的执行,这意味着总体执行时间会下降。 Put another way, if total time is n_elements*time_per_element, if time_per_element gets fast enough then total_time will decrease even if n_elements has increased.换句话说,如果总时间是 n_elements*time_per_element,如果 time_per_element 足够快,那么即使 n_elements 增加,total_time 也会减少。

In addition to blocking threads, you can also initialize all the data up front.除了阻塞线程,您还可以预先初始化所有数据。 The two changes are:这两个变化是:

    matrix, vector = ranMatrix(10000)
    for value in n_values:
       t1 = threading.Thread(target=threadss , args=(value,matrix,vector))
       t1.start()

and

def threadss(value,matrix,vector):

    threading.Thread(target=linealComb , args=(matrix , vector, value,)).start()
    threading.Thread(target=innerProduct , args=(matrix , vector, value,)).start()

Since the code runs over a fixed size, it doesn't matter if the matrix and vector are actually larger.由于代码在固定大小上运行,因此矩阵和向量实际上是否更大并不重要。 This too also makes it so that larger data sizes don't ever run faster.这也使得更大的数据不会跑得更快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM