密集矩阵乘法优化在python中影响很小

Question

I would like to know why the Loop optimization inversion on i and j (which should give better data locality) does not work with the following code in python : 我想知道为什么在i和j上进行的循环优化反演（应该提供更好的数据局部性）不适用于python中的以下代码：

import random
import time

def fillRand1D(size):
    return [random.uniform(-2.0, 2.0) for _ in range(size * size)]

def mmNaive1D(A, B, size):
    C = [0] * size * size
    for i in range(size):
        for j in range(size):
            for k in range(size):
                C[i * size + j] += A[i * size + k] * B[k * size + j]
    return C

def mmInvariant1D(A, B, size):
    C = [0] * size * size
    for i in range(size):
        for j in range(size):
            sigma = C[i * size + j]
            for k in range(size):
                sigma += A[i * size + k] * B[k * size + j]
            C[i * size + j] = sigma
    return C


def mmLoop1D(A, B, size):
    C = [0] * size * size
    for k in range(size):
        for i in range(size):
            for j in range(size):
                C[i * size + j] += A[i * size + k] * B[k * size + j]
    return C


def mmLoopInvariant1D(A, B, size):
    C = [0] * size * size
    for k in range(size):
        for i in range(size):
            Aik = A[i * size + k]
            for j in range(size):
                C[i * size + j] += Aik * B[k * size + j]
    return C

def main():
    matmul_func_1D = [mmNaive1D,mmInvariant1D,mmLoop1D,mmLoopInvariant1D]
    size = 200
    A_1D = fillRand1D(size)
    B_1D = fillRand1D(size)

    for f in matmul_func_1D:
        A = A_1D[:] # copy !
        B = B_1D[:]
        start_time = time.time()
        C = f(A_1D,B_1D,size)
        # print(T)
        print(f.__name__ + " in " + str(time.time() - start_time) + " s")

if __name__ == '__main__':
    main()

Results are with python : 结果与python：

mmNaive1D in 3.420367956161499 s  
mmInvariant1D in 2.316128730773926 s  
mmLoop1D in 3.4071271419525146 s  
mmLoopInvariant1D in 2.5221548080444336 s

Whereas the same optimizations written in C++ give : 而用C ++编写的相同优化给出了：

> Time [MM naive] 1.780587 s
> Time [MM invariant] 1.642554 s
> Time [MM loop IKJ] 0.304621 s
> Time [MM loop IKJ invariant] 0.276159 s

Answer 1

Your A, B and C matrices are python lists, which are quite slow for big dimensions. 您的A, B and C矩阵是python列表，对于大尺寸而言，这非常慢。 You could create them as a numpy arrays to try to speed up the functions: 您可以将它们创建为numpy数组，以尝试加速功能：

>>> import numpy as np
>>> A = np.random.uniform(-2., 2., (size,size))
>>> B = np.random.uniform(-2., 2., (size,size))

This defines A and B matrices as size x size random matrices. 这将A和B矩阵定义为size x size随机矩阵。

The bottleneck of your code are the python loops. 代码的瓶颈是python循环。 Python loops are quite slow, and you want to avoid them in algebraic operations whenever possible. Python循环非常慢，因此您希望尽可能避免在代数运算中使用它们。 In your case, what you are trying to achieve (if I'm not wrong) is C = A * B where * is the dot product between two 2D matrices. 在您的情况下，您要实现的目标（如果我没错的话）是C = A * B ，其中*是两个2D矩阵之间的点积。 Translated to numpy: 转换为numpy：

>>> C = A.dot(B)

If you time it: 如果您计时：

>>> %timeit C = A.dot(B)
100 loops, best of 3: 7.91 ms per loop

Takes only 7.91 ms to calculate the dot product. 仅需7.91 ms即可计算出点积。 I know that you wanted to play with python, but if you are going to do mathematical calculus, the sooner you move to numpy the better for your algorithms' speeds. 我知道您想使用python，但如果要进行数学演算，则越早转向numpy，算法的速度就会越好。

Going back to your algorithm, if I run your code in my computer, the optimization does work and achieves a bit of speedup: 回到您的算法，如果我在计算机上运行您的代码，则优化确实可以实现并可以实现一些加速：

mmNaive1D in 1.72708702087 s
mmInvariant1D in 1.64227509499 s
mmLoop1D in 1.57529997826 s
mmLoopInvariant1D in 1.26218104362 s

So it is working for me. 所以对我有用。

EDIT: After runing the code several times I'm getting different results: 编辑：几次运行代码后，我得到不同的结果：

mmNaive1D in 1.63492894173 s
mmInvariant1D in 1.1577808857 s
mmLoop1D in 1.67409181595 s
mmLoopInvariant1D in 1.32283711433 s

This time the "optimization" is not working. 这次“优化”无效。 And I guess it is because the bottleneck of the algorithm are the python loops, not the memory accesses. 我想这是因为算法的瓶颈是python循环，而不是内存访问。

EDIT2: Some more relevant results performing 100 iterations of each algorithm to get mean timings (took it a while haha): EDIT2：一些更相关的结果执行每种算法的100次迭代以获得平均计时（花了一段时间哈哈）：

mmNaive1D in 1.66692941904 s
mmInvariant1D in 1.15141540051 s
mmLoop1D in 1.58852998018 s
mmLoopInvariant1D in 1.28386260986 s

Even if 100 iterations are still not revelant enough (like 1.000.000 would be better, but it would take ages), it suggest that as the mean of 100 runtimes, the "optimization" is not really an optimization. 即使100次迭代仍然不够充分（例如1.000.000会更好，但要花一些时间），它表明作为100次运行时的平均值，“优化”并不是真正的优化。 I don't really know why, but it could be that adding a new variable in python is slower than 200 memory accesses (maybe there is some cache behind it). 我真的不知道为什么，但是可能是在python中添加一个新变量比200次内存访问慢（也许背后有一些缓存）。

密集矩阵乘法优化在python中影响很小

问题描述

1 个解决方案

解决方案1
2 已采纳 2014-12-22 00:03:42

密集矩阵乘法优化在python中影响很小

问题描述

1 个解决方案

解决方案1 2 已采纳 2014-12-22 00:03:42

解决方案1
2 已采纳 2014-12-22 00:03:42