[英]Why does a nested loop perform much faster than the flattened one?
UPDATE更新
Sorry guys but the pervious number is INACCURATE.对不起,伙计们,但以前的数字是不准确的。 When I tested the previous code I used
tqdm
to see the expected time and the function will hurt the performance when the iterable object is very long.当我测试之前的代码时,我使用
tqdm
查看预期时间,当可迭代的 object 非常长时,function 会损害性能。 So I get 18.22s, which is 9 times longer than 2.43s.所以我得到 18.22s,比 2.43s 长 9 倍。
HOWEVER, the conclusion is CONSISTENT: The nested loop is much FASTER.但是,结论是一致的:嵌套循环要快得多。 When the iteration time is 100^5, the difference is significant: 321.49 vs 210.05.
当迭代时间为 100^5 时,差异显着:321.49 vs 210.05。 There is about 1.53-time gap between them.
它们之间大约有1.53倍的差距。 Generally, we don't face this kind of long iteration, I'm just curious to know reason of the anomalistic situation.
一般来说,我们不会面对这种长时间的迭代,我只是想知道异常情况的原因。
My python version is 3.7.3.我的 python 版本是 3.7.3。 I use 13-inch MacBookPro2019 with 2.4 GHz Intel Core i5.
我使用配备 2.4 GHz Intel Core i5 的 13 英寸 MacBookPro2019。 The OS is macOS Mojave 10.14.6
操作系统是 macOS Mojave 10.14.6
I found a weird situation in python as follows:我在 python 中发现了一个奇怪的情况,如下:
import time
start = time.time()
# flattened loop
for i in range(100**4):
pass
print(time.time() - start) # 18.22(Wrong! Should be 3.09)
# nested loop
start = time.time()
for i in range(100):
for j in range(100):
for k in range(100):
for l in range(100):
pass
print(time.time() - start) # 2.43
The two kinds of loops above have the same iteration times.上述两种循环的迭代次数相同。 However the nested loop(2.43s) is running much faster that the flattened one(18.22s).
然而,嵌套循环(2.43s)的运行速度比扁平循环(18.22s)快得多。 The difference is bigger with the increasing of the iteration time.
随着迭代时间的增加,差异越大。 Hoes does this happen?
会发生这种情况吗?
Firstly, that is not a reliable way of measuring code execution.首先,这不是衡量代码执行的可靠方法。 Let us consider this code instead (to be runned in IPython), which does not include the power calculation in the loop, and has some computation just to make sure that it cannot be "optimized" to "please do nothing".
让我们考虑一下这段代码(在 IPython 中运行),它不包括循环中的功率计算,并且有一些计算只是为了确保它不能“优化”为“请什么都不做”。
def flattened_loop(n):
x = 0
for i in range(n):
x += 1
return x
def nested4_loop(n):
x = 0
for i in range(n):
for j in range(n):
for k in range(n):
for l in range(n):
x += 1
return x
print(f'{"n":>4s} {"n ** 4":>16s} {"flattened":>18s} {"nested4":>18s}')
for n in range(10, 120, 10):
t1 = %timeit -q -o flattened_loop(n)
t2 = %timeit -q -o nested4_loop(n)
print(f'{n:4} {n**4:16} {t1.best * 1e3:15.3f} ms {t2.best * 1e3:15.3f} ms')
n n ** 4 flattened nested4
10 10000 0.526 ms 0.653 ms
20 160000 8.561 ms 8.459 ms
30 810000 43.077 ms 39.417 ms
40 2560000 136.709 ms 121.422 ms
50 6250000 331.748 ms 291.132 ms
60 12960000 698.014 ms 599.228 ms
70 24010000 1280.681 ms 1081.062 ms
80 40960000 2187.500 ms 1826.629 ms
90 65610000 3500.463 ms 2942.909 ms
100 100000000 5349.721 ms 4437.965 ms
110 146410000 7835.733 ms 6474.588 ms
which shows that a difference does exists and it is larger for larger n
.这表明确实存在差异,并且
n
越大,差异越大。
Is the first one running more bytecode?第一个运行更多字节码吗? No. We can clearly see this through
dis
:不,我们可以通过
dis
清楚地看到这一点:
flattened_loop()
import dis
dis.dis(flattened_loop)
2 0 LOAD_CONST 1 (0)
2 STORE_FAST 1 (x)
3 4 SETUP_LOOP 24 (to 30)
6 LOAD_GLOBAL 0 (range)
8 LOAD_FAST 0 (n)
10 CALL_FUNCTION 1
12 GET_ITER
>> 14 FOR_ITER 12 (to 28)
16 STORE_FAST 2 (i)
4 18 LOAD_FAST 1 (x)
20 LOAD_CONST 2 (1)
22 INPLACE_ADD
24 STORE_FAST 1 (x)
26 JUMP_ABSOLUTE 14
>> 28 POP_BLOCK
5 >> 30 LOAD_FAST 1 (x)
32 RETURN_VALUE
nested4_loop()
dis.dis(nested4_loop)
9 0 LOAD_CONST 1 (0)
2 STORE_FAST 1 (x)
10 4 SETUP_LOOP 78 (to 84)
6 LOAD_GLOBAL 0 (range)
8 LOAD_FAST 0 (n)
10 CALL_FUNCTION 1
12 GET_ITER
>> 14 FOR_ITER 66 (to 82)
16 STORE_FAST 2 (i)
11 18 SETUP_LOOP 60 (to 80)
20 LOAD_GLOBAL 0 (range)
22 LOAD_FAST 0 (n)
24 CALL_FUNCTION 1
26 GET_ITER
>> 28 FOR_ITER 48 (to 78)
30 STORE_FAST 3 (j)
12 32 SETUP_LOOP 42 (to 76)
34 LOAD_GLOBAL 0 (range)
36 LOAD_FAST 0 (n)
38 CALL_FUNCTION 1
40 GET_ITER
>> 42 FOR_ITER 30 (to 74)
44 STORE_FAST 4 (k)
13 46 SETUP_LOOP 24 (to 72)
48 LOAD_GLOBAL 0 (range)
50 LOAD_FAST 0 (n)
52 CALL_FUNCTION 1
54 GET_ITER
>> 56 FOR_ITER 12 (to 70)
58 STORE_FAST 5 (l)
14 60 LOAD_FAST 1 (x)
62 LOAD_CONST 2 (1)
64 INPLACE_ADD
66 STORE_FAST 1 (x)
68 JUMP_ABSOLUTE 56
>> 70 POP_BLOCK
>> 72 JUMP_ABSOLUTE 42
>> 74 POP_BLOCK
>> 76 JUMP_ABSOLUTE 28
>> 78 POP_BLOCK
>> 80 JUMP_ABSOLUTE 14
>> 82 POP_BLOCK
15 >> 84 LOAD_FAST 1 (x)
86 RETURN_VALUE
Are the numbers in the single loops getting too big?单个循环中的数字是否变得太大? No.
不。
import sys
print([(n, sys.getsizeof(n), n ** 4, sys.getsizeof(n ** 4)) for n in (10, 110)])
# [(10, 28, 10000, 28), (110, 28, 146410000, 28)]
Is the power operation (not timed in my code, but timed in yours) explaining the timing difference (timed only once because constant computations get cached in Python)?电源操作(不是在我的代码中计时,而是在您的代码中计时)是否解释了计时差异(仅计时一次,因为常量计算在 Python 中被缓存)? No.
不。
%timeit -r1 -n1 100 ** 4
# loop, best of 1: 708 ns per loop
So, what is happening?那么,发生了什么?
At this point this is just speculation, but, given that we have ruled out at least some of the potential candidates, I believe that this is due some caching mechanism that is taking place in the nested version.在这一点上,这只是猜测,但是,鉴于我们已经排除了至少一些潜在的候选者,我相信这是由于嵌套版本中发生了一些缓存机制。 Such caching is probably in place to speed up the notoriously comparatively slow explicit looping.
这种缓存可能是为了加速众所周知的相对缓慢的显式循环。
If we repeat the same test with Numba compilation (where loops gets lifted, ie executed without the boilerplate required by Python to ensure its dynamism), we do actually get:如果我们用 Numba 编译重复相同的测试(其中循环被解除,即在没有 Python 所需的样板文件的情况下执行以确保其活力),我们实际上得到:
import numba as nb
@nb.jit
def flattened_loop_nb(n):
x = 0
for i in range(n):
x += 1
return x
@nb.jit
def nested4_loop_nb(n):
x = 0
for i in range(n):
for j in range(n):
for k in range(n):
for l in range(n):
x += 1
return x
flattened_loop_nb(100) # trigger compilation
nested4_loop_nb(100) # trigger compilation
print(f'{"n":>4s} {"n ** 4":>16s} {"flattened":>18s} {"nested4":>18s}')
for n in range(10, 120, 10):
m = n ** 4
t1 = %timeit -q -o flattened_loop_nb(m)
t2 = %timeit -q -o nested4_loop_nb(n)
print(f'{n:4} {n**4:16} {t1.best * 1e6:15.3f} µs {t2.best * 1e6:15.3f} µs')
n n ** 4 flattened nested4
10 10000 0.195 µs 0.199 µs
20 160000 0.196 µs 0.201 µs
30 810000 0.196 µs 0.200 µs
40 2560000 0.195 µs 0.197 µs
50 6250000 0.193 µs 0.199 µs
60 12960000 0.195 µs 0.199 µs
70 24010000 0.197 µs 0.200 µs
80 40960000 0.195 µs 0.199 µs
90 65610000 0.194 µs 0.197 µs
100 100000000 0.195 µs 0.199 µs
110 146410000 0.194 µs 0.199 µs
Slightly slower (but largely independent on n
) execution speed for the nested loops (as expected).嵌套循环的执行速度稍慢(但很大程度上独立于
n
)(如预期的那样)。
2.43s is a certainly a bit too small compared to 18.22s.与 18.22s 相比,2.43s 确实有点太小了。 Surprisingly, the nested loop does seem to be slightly faster on my machine all the time.
令人惊讶的是,嵌套循环在我的机器上似乎总是稍微快一些。 The cause could be that it's machine dependent.
原因可能是它依赖于机器。
Another possible reason could be that in the first loop, very large numbers have to be assigned to the iterating variable while in the second loop they are all simply within 100. The given two loops run at 4.2s and 3.9s on my machine;另一个可能的原因可能是,在第一个循环中,必须将非常大的数字分配给迭代变量,而在第二个循环中,它们都在 100 以内。给定的两个循环在我的机器上以 4.2s 和 3.9s 运行; and increasing a zero to 1e9, it takes 43.3s and 39.2s respectively.
将零增加到1e9,分别需要43.3s和39.2s。
I did run the script you provided with both python 2.x and 3.x multiple times.我确实多次运行了您为 python 2.x 和 3.x 提供的脚本。 It did surprise me too that nested loop constantly completed faster.
嵌套循环不断更快地完成也让我感到惊讶。 Here is what is think that what causes this issue:
以下是导致此问题的原因:
When you run your python script operating system you are running on will assign a PID for it.It can be interrupted by system calls and its priority can be changed over time.当您运行 python 脚本操作系统时,您正在运行的操作系统将为它分配一个 PID。它可以被系统调用中断,并且它的优先级可以随时间改变。 But system is not likely to take resources away from a process when you change memory addresses or values.
但是当您更改 memory 地址或值时,系统不太可能从进程中占用资源。 When you run flat for loop it is assigning (also i tried them assignment statement ones they have much more close results.) much less variables than nested loop.
当您运行平坦的 for 循环时,它正在分配(我也尝试过它们的赋值语句,它们的结果更接近。)比嵌套循环少得多的变量。 So we can say that nested loop utilizes resources more than flat loop if they are available .
所以我们可以说嵌套循环比平面循环更能利用资源(如果它们可用的话)。 If they are not (Try them in constrained docker containers) flat loop will be faster.
如果不是(在受限的 docker 容器中尝试它们)扁平循环会更快。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.