为什么嵌套循环比扁平循环执行得快得多？

Question

UPDATE更新

Sorry guys but the pervious number is INACCURATE.对不起，伙计们，但以前的数字是不准确的。 When I tested the previous code I used tqdm to see the expected time and the function will hurt the performance when the iterable object is very long.当我测试之前的代码时，我使用tqdm查看预期时间，当可迭代的 object 非常长时，function 会损害性能。 So I get 18.22s, which is 9 times longer than 2.43s.所以我得到 18.22s，比 2.43s 长 9 倍。

HOWEVER, the conclusion is CONSISTENT: The nested loop is much FASTER.但是，结论是一致的：嵌套循环要快得多。 When the iteration time is 100^5, the difference is significant: 321.49 vs 210.05.当迭代时间为 100^5 时，差异显着：321.49 vs 210.05。 There is about 1.53-time gap between them.它们之间大约有1.53倍的差距。 Generally, we don't face this kind of long iteration, I'm just curious to know reason of the anomalistic situation.一般来说，我们不会面对这种长时间的迭代，我只是想知道异常情况的原因。

My python version is 3.7.3.我的 python 版本是 3.7.3。 I use 13-inch MacBookPro2019 with 2.4 GHz Intel Core i5.我使用配备 2.4 GHz Intel Core i5 的 13 英寸 MacBookPro2019。 The OS is macOS Mojave 10.14.6操作系统是 macOS Mojave 10.14.6

I found a weird situation in python as follows:我在 python 中发现了一个奇怪的情况，如下：

import time

start = time.time()
# flattened loop
for i in range(100**4):
    pass
print(time.time() - start) # 18.22(Wrong! Should be 3.09)

# nested loop
start = time.time()
for i in range(100):
    for j in range(100):
        for k in range(100):
            for l in range(100):
                pass
print(time.time() - start) # 2.43

The two kinds of loops above have the same iteration times.上述两种循环的迭代次数相同。 However the nested loop(2.43s) is running much faster that the flattened one(18.22s).然而，嵌套循环（2.43s）的运行速度比扁平循环（18.22s）快得多。 The difference is bigger with the increasing of the iteration time.随着迭代时间的增加，差异越大。 Hoes does this happen?会发生这种情况吗？

Answer 1

Firstly, that is not a reliable way of measuring code execution.首先，这不是衡量代码执行的可靠方法。 Let us consider this code instead (to be runned in IPython), which does not include the power calculation in the loop, and has some computation just to make sure that it cannot be "optimized" to "please do nothing".让我们考虑一下这段代码（在 IPython 中运行），它不包括循环中的功率计算，并且有一些计算只是为了确保它不能“优化”为“请什么都不做”。

def flattened_loop(n):
    x = 0
    for i in range(n):
        x += 1
    return x


def nested4_loop(n):
    x = 0
    for i in range(n):
        for j in range(n):
            for k in range(n):
                for l in range(n):
                    x += 1
    return x


print(f'{"n":>4s}  {"n ** 4":>16s}  {"flattened":>18s}  {"nested4":>18s}')
for n in range(10, 120, 10):
    t1 = %timeit -q -o flattened_loop(n)
    t2 = %timeit -q -o nested4_loop(n)
    print(f'{n:4}  {n**4:16}  {t1.best * 1e3:15.3f} ms  {t2.best * 1e3:15.3f} ms')

   n            n ** 4           flattened             nested4
  10             10000            0.526 ms            0.653 ms
  20            160000            8.561 ms            8.459 ms
  30            810000           43.077 ms           39.417 ms
  40           2560000          136.709 ms          121.422 ms
  50           6250000          331.748 ms          291.132 ms
  60          12960000          698.014 ms          599.228 ms
  70          24010000         1280.681 ms         1081.062 ms
  80          40960000         2187.500 ms         1826.629 ms
  90          65610000         3500.463 ms         2942.909 ms
 100         100000000         5349.721 ms         4437.965 ms
 110         146410000         7835.733 ms         6474.588 ms

which shows that a difference does exists and it is larger for larger n .这表明确实存在差异，并且n越大，差异越大。

Is the first one running more bytecode?第一个运行更多字节码吗？ No. We can clearly see this through dis :不，我们可以通过dis清楚地看到这一点：

flattened_loop()

import dis


dis.dis(flattened_loop)

  2           0 LOAD_CONST               1 (0)
              2 STORE_FAST               1 (x)

  3           4 SETUP_LOOP              24 (to 30)
              6 LOAD_GLOBAL              0 (range)
              8 LOAD_FAST                0 (n)
             10 CALL_FUNCTION            1
             12 GET_ITER
        >>   14 FOR_ITER                12 (to 28)
             16 STORE_FAST               2 (i)

  4          18 LOAD_FAST                1 (x)
             20 LOAD_CONST               2 (1)
             22 INPLACE_ADD
             24 STORE_FAST               1 (x)
             26 JUMP_ABSOLUTE           14
        >>   28 POP_BLOCK

  5     >>   30 LOAD_FAST                1 (x)
             32 RETURN_VALUE

nested4_loop()

dis.dis(nested4_loop)

  9           0 LOAD_CONST               1 (0)
              2 STORE_FAST               1 (x)

 10           4 SETUP_LOOP              78 (to 84)
              6 LOAD_GLOBAL              0 (range)
              8 LOAD_FAST                0 (n)
             10 CALL_FUNCTION            1
             12 GET_ITER
        >>   14 FOR_ITER                66 (to 82)
             16 STORE_FAST               2 (i)

 11          18 SETUP_LOOP              60 (to 80)
             20 LOAD_GLOBAL              0 (range)
             22 LOAD_FAST                0 (n)
             24 CALL_FUNCTION            1
             26 GET_ITER
        >>   28 FOR_ITER                48 (to 78)
             30 STORE_FAST               3 (j)

 12          32 SETUP_LOOP              42 (to 76)
             34 LOAD_GLOBAL              0 (range)
             36 LOAD_FAST                0 (n)
             38 CALL_FUNCTION            1
             40 GET_ITER
        >>   42 FOR_ITER                30 (to 74)
             44 STORE_FAST               4 (k)

 13          46 SETUP_LOOP              24 (to 72)
             48 LOAD_GLOBAL              0 (range)
             50 LOAD_FAST                0 (n)
             52 CALL_FUNCTION            1
             54 GET_ITER
        >>   56 FOR_ITER                12 (to 70)
             58 STORE_FAST               5 (l)

 14          60 LOAD_FAST                1 (x)
             62 LOAD_CONST               2 (1)
             64 INPLACE_ADD
             66 STORE_FAST               1 (x)
             68 JUMP_ABSOLUTE           56
        >>   70 POP_BLOCK
        >>   72 JUMP_ABSOLUTE           42
        >>   74 POP_BLOCK
        >>   76 JUMP_ABSOLUTE           28
        >>   78 POP_BLOCK
        >>   80 JUMP_ABSOLUTE           14
        >>   82 POP_BLOCK

 15     >>   84 LOAD_FAST                1 (x)
             86 RETURN_VALUE

Are the numbers in the single loops getting too big?单个循环中的数字是否变得太大？ No.不。

import sys


print([(n, sys.getsizeof(n), n ** 4, sys.getsizeof(n ** 4)) for n in (10, 110)])
# [(10, 28, 10000, 28), (110, 28, 146410000, 28)]

Is the power operation (not timed in my code, but timed in yours) explaining the timing difference (timed only once because constant computations get cached in Python)?电源操作（不是在我的代码中计时，而是在您的代码中计时）是否解释了计时差异（仅计时一次，因为常量计算在 Python 中被缓存）？ No.不。

%timeit -r1 -n1 100 ** 4
# loop, best of 1: 708 ns per loop

So, what is happening?那么，发生了什么？

At this point this is just speculation, but, given that we have ruled out at least some of the potential candidates, I believe that this is due some caching mechanism that is taking place in the nested version.在这一点上，这只是猜测，但是，鉴于我们已经排除了至少一些潜在的候选者，我相信这是由于嵌套版本中发生了一些缓存机制。 Such caching is probably in place to speed up the notoriously comparatively slow explicit looping.这种缓存可能是为了加速众所周知的相对缓慢的显式循环。

If we repeat the same test with Numba compilation (where loops gets lifted, ie executed without the boilerplate required by Python to ensure its dynamism), we do actually get:如果我们用 Numba 编译重复相同的测试（其中循环被解除，即在没有 Python 所需的样板文件的情况下执行以确保其活力），我们实际上得到：

import numba as nb


@nb.jit
def flattened_loop_nb(n):
    x = 0
    for i in range(n):
        x += 1
    return x


@nb.jit
def nested4_loop_nb(n):
    x = 0
    for i in range(n):
        for j in range(n):
            for k in range(n):
                for l in range(n):
                    x += 1
    return x


flattened_loop_nb(100)  # trigger compilation
nested4_loop_nb(100)  # trigger compilation


print(f'{"n":>4s}  {"n ** 4":>16s}  {"flattened":>18s}  {"nested4":>18s}')
for n in range(10, 120, 10):
    m = n ** 4
    t1 = %timeit -q -o flattened_loop_nb(m)
    t2 = %timeit -q -o nested4_loop_nb(n)
    print(f'{n:4}  {n**4:16}  {t1.best * 1e6:15.3f} µs  {t2.best * 1e6:15.3f} µs')

   n            n ** 4           flattened             nested4
  10             10000            0.195 µs            0.199 µs
  20            160000            0.196 µs            0.201 µs
  30            810000            0.196 µs            0.200 µs
  40           2560000            0.195 µs            0.197 µs
  50           6250000            0.193 µs            0.199 µs
  60          12960000            0.195 µs            0.199 µs
  70          24010000            0.197 µs            0.200 µs
  80          40960000            0.195 µs            0.199 µs
  90          65610000            0.194 µs            0.197 µs
 100         100000000            0.195 µs            0.199 µs
 110         146410000            0.194 µs            0.199 µs

Slightly slower (but largely independent on n ) execution speed for the nested loops (as expected).嵌套循环的执行速度稍慢（但很大程度上独立于n ）（如预期的那样）。

Answer 2

2.43s is a certainly a bit too small compared to 18.22s.与 18.22s 相比，2.43s 确实有点太小了。 Surprisingly, the nested loop does seem to be slightly faster on my machine all the time.令人惊讶的是，嵌套循环在我的机器上似乎总是稍微快一些。 The cause could be that it's machine dependent.原因可能是它依赖于机器。

Another possible reason could be that in the first loop, very large numbers have to be assigned to the iterating variable while in the second loop they are all simply within 100. The given two loops run at 4.2s and 3.9s on my machine;另一个可能的原因可能是，在第一个循环中，必须将非常大的数字分配给迭代变量，而在第二个循环中，它们都在 100 以内。给定的两个循环在我的机器上以 4.2s 和 3.9s 运行； and increasing a zero to 1e9, it takes 43.3s and 39.2s respectively.将零增加到1e9，分别需要43.3s和39.2s。

Answer 3

I did run the script you provided with both python 2.x and 3.x multiple times.我确实多次运行了您为 python 2.x 和 3.x 提供的脚本。 It did surprise me too that nested loop constantly completed faster.嵌套循环不断更快地完成也让我感到惊讶。 Here is what is think that what causes this issue:以下是导致此问题的原因：

When you run your python script operating system you are running on will assign a PID for it.It can be interrupted by system calls and its priority can be changed over time.当您运行 python 脚本操作系统时，您正在运行的操作系统将为它分配一个 PID。它可以被系统调用中断，并且它的优先级可以随时间改变。 But system is not likely to take resources away from a process when you change memory addresses or values.但是当您更改 memory 地址或值时，系统不太可能从进程中占用资源。 When you run flat for loop it is assigning (also i tried them assignment statement ones they have much more close results.) much less variables than nested loop.当您运行平坦的 for 循环时，它正在分配（我也尝试过它们的赋值语句，它们的结果更接近。）比嵌套循环少得多的变量。 So we can say that nested loop utilizes resources more than flat loop if they are available .所以我们可以说嵌套循环比平面循环更能利用资源（如果它们可用的话）。 If they are not (Try them in constrained docker containers) flat loop will be faster.如果不是（在受限的 docker 容器中尝试它们）扁平循环会更快。

为什么嵌套循环比扁平循环执行得快得多？

问题描述

3 个解决方案

解决方案1
7 2020-05-29 17:14:33

解决方案2
1 2020-05-29 16:30:58

解决方案3
1 2020-05-29 16:51:13

为什么嵌套循环比扁平循环执行得快得多？

问题描述

3 个解决方案

解决方案1 7 2020-05-29 17:14:33

解决方案2 1 2020-05-29 16:30:58

解决方案3 1 2020-05-29 16:51:13

解决方案1
7 2020-05-29 17:14:33

解决方案2
1 2020-05-29 16:30:58

解决方案3
1 2020-05-29 16:51:13