为什么我教授的LU分解版本比我的快？ Python numpy

Question

I am attending a course in Numerical Analysis at my university. 我正在上大学参加数值分析课程。 We are studying LU decomposition . 我们正在研究LU分解。 I tried implementing my version before looking at my lecturer's. 在查看我的讲师之前，我尝试实施我的版本。 I thought mine was pretty fast, but actually comparing them, my lecturer's version is much faster even though it uses loops! 我觉得我的速度很快，但实际上比较它们，即使它使用循环，我的讲师的版本也要快得多！ Why is that? 这是为什么？

Lecturer Version 讲师版

def LU_decomposition(A):
    """Perform LU decomposition using the Doolittle factorisation."""

    L = np.zeros_like(A)
    U = np.zeros_like(A)
    N = np.size(A, 0)

    for k in range(N):
        L[k, k] = 1
        U[k, k] = (A[k, k] - np.dot(L[k, :k], U[:k, k])) / L[k, k]
        for j in range(k+1, N):
            U[k, j] = (A[k, j] - np.dot(L[k, :k], U[:k, j])) / L[k, k]
        for i in range(k+1, N):
            L[i, k] = (A[i, k] - np.dot(L[i, :k], U[:k, k])) / U[k, k]

    return L, U

My Version 我的版本

def lu(A, non_zero = 1):
    '''
    Given a matrix A, factorizes it into two matrices L and U, where L is
    lower triangular and U is upper triangular. This method implements
    Doolittle's method which sets l_ii = 1, i.e. L is a unit triangular
    matrix.

    :param      A: Matrix to be factorized. NxN
    :type       A: numpy.array

    :param non_zero: Value to which l_ii is assigned to. Must be non_zero.
    :type  non_zero: non-zero float.

    :return: (L, U)
    '''
    # Check if the matrix is square
    if A.shape[0] != A.shape[1]:
        return 'Input argument is not a square matrix.'

    # Store the size of the matrix
    n = A.shape[0]

    # Instantiate two zero matrices NxN (L, U)
    L = np.zeros((n,n), dtype = float)
    U = np.zeros((n,n), dtype = float)

    # Start algorithm
    for k in range(n):
        # Specify non-zero value for l_kk (Doolittle's)
        L[k, k] = non_zero
        # Case k = 0 is trivial
        if k == 0:
            # Complete first row of U
            U[0, :] = A[0, :] / L[0, 0]
            # Complete first column of L
            L[:, 0] = A[:, 0] / U[0, 0]
        # Case k = n-1 is trivial
        elif k == n-1:
            # Obtain  u_nn
            U[-1, -1] = (A[-1, -1] - np.dot(L[-1, :], U[:, -1])) / L[-1, -1]

        else:
            # Obtain u_kk
            U[k, k] = (A[k, k] - np.dot(L[k, :], U[:, k])) / L[k, k]
            # Complete kth row of U
            U[k, k+1:] = (A[k, k+1:] - [np.dot(L[k, :], U[:, i]) for i in \
                         range(k+1, n)]) / L[k, k]
            # Complete kth column of L
            L[k+1:, k] = (A[k+1:, k] - [np.dot(L[i, :], U[:, k]) for i in \
                         range(k+1, n)]) / U[k, k]
    return L, U

Benchmarking 标杆

I used the following commands: 我使用了以下命令：

A = np.random.randint(1, 10, size = (4,4))
%timeit lu(A)
57.5 µs ± 2.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit LU_decomposition(A)
42.1 µs ± 776 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

And also, how come scipy's version is so much better? 而且，为什么scipy的版本更好？

scipy.linalg.lu(A)
6.47 µs ± 219 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Answer 1

Your code has conditionals in the python code, where the lecture version does not. 你的代码在python代码中有条件，讲座版本没有。 The numpy library is highly optimized in native code so anything you can do to push the computation into numpy as opposed to python would help make it faster. numpy库在本机代码中进行了高度优化，因此您可以采取任何措施将计算推送到numpy而不是python，这将有助于提高速度。

Scipy must have an even more optimized version of this in its library, seeing as how it's a single call to do this operation, the outer loop is likely part of the optimized native code instead of the relatively slow python code. Scipy在它的库中必须有一个更优化的版本，看看它是如何进行单个调用来执行此操作，外部循环可能是优化的本机代码的一部分，而不是相对较慢的python代码。

You might try benchmarking using Cython and see what difference a more optimized python runtime has. 你可以尝试使用Cython进行基准测试，看看更优化的python运行时有什么区别。

Answer 2

I think yours is slower because of the intermediary data strucutures that you use: 由于您使用的中间数据结构，我认为您的速度较慢：

a python list is created with [np.dot(L[k, :], U[:, i]) for i in range(k+1, n)] 使用[np.dot(L[k, :], U[:, i]) for i in range(k+1, n)]创建一个python列表
a numpy array is created with A[k, k+1:] - temp_list 使用A[k, k+1:] - temp_list创建一个numpy数组
another temporary numpy array is created with temp_ndarray / L[k, k] 使用temp_ndarray / L[k, k]创建另一个临时numpy数组
finally, this temporary array is copied into the result array 最后，将此临时数组复制到结果数组中

For each of these steps the CPU has to execute a loop, even if you didn't write one explicitly. 对于每个步骤，CPU必须执行循环，即使您没有显式写入循环。 Numpy abstracts these loops away but they still have to be executed! Numpy将这些循环抽象出去，但它们仍然必须被执行！ Of course oftentimes it can pay off to have X implicit fast loops in numpy instead of 1 python loop, but this is only the case for medium size arrays. 当然，通常在numpy而不是1个python循环中使用X隐式快速循环可以获得回报，但这仅适用于中型数组。 Also, a list comprehension really is only marginally faster than a regular for-loop. 此外，列表理解实际上只比常规for循环略快。

The scipy one is faster because it's a highly optimized implementation in a low level proramming language (whereas python is a very high level language). scipy更快，因为它是低级别编写语言的高度优化的实现（而python是一种非常高级的语言）。 In the end what this probably means is that you should appreciate your prof's code for its elegance and readability not for its speed :) 最后这可能意味着你应该欣赏你的教授代码的优雅和可读性而不是速度:)

为什么我教授的LU分解版本比我的快？ Python numpy

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-02-09 23:43:56

解决方案2
1 2018-02-11 14:04:02

为什么我教授的LU分解版本比我的快？ Python numpy

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-02-09 23:43:56

解决方案2 1 2018-02-11 14:04:02

解决方案1
2 已采纳 2018-02-09 23:43:56

解决方案2
1 2018-02-11 14:04:02