为什么原生python列表上的for循环比numpy数组上的for循环快

Question

I was reading the chapter that introduces numpy in High Performance Python and played with the code on my own computer. 我正在阅读介绍高性能Python中的 numpy并在我自己的计算机上使用代码的章节。 Accidentally I ran the numpy version with for loop and found the result was surprisingly slow compared to native python loop. 偶然地，我用for循环运行了numpy版本，发现与本地python循环相比，结果出奇地慢。

A simplified version of the code is as follows, where I defined a 2D array X with 0 and another 2D array Y with 1, and then repeatedly add Y to X, conceptually X += Y. 该代码的简化版本如下所示，其中我定义了一个0的2D数组X和另一个1的2D数组Y，然后将Y重复地添加到X，从概念上讲X + =Y。

import time
import numpy as np

grid_shape = (1024, 1024)

def simple_loop_comparison():
    xmax, ymax = grid_shape

    py_grid = [[0]*ymax for x in range(xmax)]
    py_ones = [[1]*ymax for x in range(xmax)]

    np_grid = np.zeros(grid_shape)
    np_ones = np.ones(grid_shape)

    def add_with_loop(grid, add_grid, xmax, ymax):
        for x in range(xmax):
            for y in range(ymax):
                grid[x][y] += add_grid[x][y]

    repeat = 20
    start = time.time()
    for i in range(repeat):
        # native python: loop over 2D array
        add_with_loop(py_grid, py_ones, xmax, ymax)
    print('for loop with native list=', time.time()-start)

    start = time.time()
    for i in range(repeat):
        # numpy: loop over 2D array
        add_with_loop(np_grid, np_ones, xmax, ymax)
    print('for loop with numpy array=', time.time()-start)

    start = time.time()
    for i in range(repeat):
        # vectorized numpy operation
        np_grid += np_ones
    print('numpy vectorization=', time.time()-start)

if __name__ == "__main__":
    simple_loop_comparison()

The result looks like: 结果看起来像：

# when repeat=10
for loop with native list= 2.545672655105591
for loop with numpy array= 11.622980833053589
numpy vectorization= 0.020279645919799805

# when repeat=20
for loop with native list= 5.195128440856934
for loop with numpy array= 23.241904258728027
numpy vectorization= 0.04613637924194336

I totally expect that numpy vectorized operation outperforms the other two but I was surprised to see that using for loop on numpy array results significantly slower than native python list. 我完全期望numpy向量化操作的性能优于其他两个，但是我很惊讶地看到在numpy数组上使用for循环的结果比本地python列表要慢得多。 My understanding was that at least the cache should relatively filled up well with numpy array, even with for loop, it should outperform list without vectorization. 我的理解是，至少使用numpy数组，即使使用for循环，缓存也应相对较好地填充，并且在不进行向量化的情况下，其性能应优于列表。

Is there something about numpy or how CPU/cache/memory works at low level that I didn't understand? 关于numpy或我不了解的低级CPU /缓存/内存如何工作？ Thank you very much. 非常感谢你。

EDIT: changed title 编辑：更改标题

Answer 1

An even simpler case - list comprehension on a list versus an array: 一个更简单的情况-对列表和数组的列表理解：

In [119]: x = list(range(1000000))
In [120]: timeit [i for i in x]
47.4 ms ± 634 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [121]: arr = np.array(x)
In [122]: timeit [i for i in arr]
131 ms ± 3.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

A list has a data buffer that contains pointers to objects else where in memory. 列表具有一个数据缓冲区，该缓冲区包含指向内存中其他对象的指针。 So iteration or indexing a list just requires looking up that pointer, and fetching the object: 因此，对列表进行迭代或索引编制仅需要查找该指针并获取对象：

In [123]: type(x[1000])
Out[123]: int

An array stores its elements in a databuffer as bytes. 数组将其元素作为字节存储在数据缓冲区中。 Fetching an element requires finding those bytes (fast), and then wrapping them in a numpy object (according to dtype). 提取元素需要（快速）找到那些字节，然后将它们包装在numpy对象中（根据dtype）。 Such an object is similar to an 0d single element array (with many of the same attributes). 这样的对象类似于0d单元素数组（具有许多相同的属性）。

In [124]: type(arr[1000])
Out[124]: numpy.int32

This indexing doesn't just fetch the number, it recreates it. 该索引不仅获取数字，而且还会重新创建它。

I often describe an object dtype array as an enhanced or degraded list. 我经常将对象dtype数组描述为增强列表或降级列表。 Like a list it contains pointers to objects elsewhere in memory, but it can't grow by append . 像列表一样，它包含指向内存中其他位置的对象的指针，但不能通过append增长。 We often say it looses many of the benefits of a numeric array. 我们经常说它失去了数字数组的许多好处。 But its iteration speed falls between the other two: 但是它的迭代速度介于其他两个之间：

In [125]: arrO = np.array(x, dtype=object)
In [127]: type(arrO[1000])
Out[127]: int
In [128]: timeit [i for i in arrO]
74.5 ms ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Anyways, I've found in other SO answers, that if you must iterate, stick with lists. 无论如何，我在其他SO答案中发现，如果必须迭代，请坚持使用列表。 And if you start with lists, it's often faster to also stick with lists. 而且，如果您从列表开始，那么坚持使用列表通常会更快。 As you note the numpy vector speed is fast, but it takes time to create an array, which may cancel out any time savings. 如您numpy vector速度很快，但是创建数组需要花费时间，这可能会抵消任何节省的时间。

Compare the time it takes to create an array from this list, with the time required to create such an array from scratch (with compiled numpy code): 比较从此列表创建数组所需的时间，与从头开始创建此类数组所需的时间（使用已编译的numpy代码）：

In [129]: timeit np.array(x)
109 ms ± 1.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [130]: timeit np.arange(len(x))
1.77 ms ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Answer 2

Since they're are conversions involved in asking numpy for data pointers, retrieving values at those pointer locations, and then using them to iterate. 由于它们是涉及向numpy询问数据指针，在那些指针位置检索值，然后使用它们进行迭代的转换。 A python list has a few less of those steps. python列表包含较少的步骤。 Numpy speed gains are only noticed if it can internally iterate or perform vector, matrix math and then return and answer or pointer to an array of answers. 只有在可以内部迭代或执行矢量，矩阵数学然后返回并返回答案或指向答案数组的指针时，才会注意到Numpy速度增益。

为什么原生python列表上的for循环比numpy数组上的for循环快

问题描述

2 个解决方案

解决方案1
4 已采纳 2017-12-15 06:45:31

解决方案2
0 2017-12-15 04:10:43

为什么原生python列表上的for循环比numpy数组上的for循环快

问题描述

2 个解决方案

解决方案1 4 已采纳 2017-12-15 06:45:31

解决方案2 0 2017-12-15 04:10:43

解决方案1
4 已采纳 2017-12-15 06:45:31

解决方案2
0 2017-12-15 04:10:43