Python list prepend time complexity

Question

Why is this code

res = []
for i in range(10000):
    res[0:0] = [i]

about ten times faster than this code?

res = []
for i in range(10000):
    res = [i] + res

I expected that both would have to move all the existing list elements to put the new integer at zero index. Both do indeed seem to be O(n^2) as the range is changed, but the slice assignment is way faster than adding, implying that there are approximately 10 times as many fundamental operations in the latter.

(Yes, both are inefficient to achieve this result and better to use deque or append then reverse the result)

Answer 1

You're right that, at a high level, the loops are computing essentially the same results in the same way. So timing differences are due to implementation details of the Python version in use. There is no property of the language that accounts for the difference.

In the python.org C implementation (CPython), the code is in fact quite different deep under the covers.

res[0:0] = [i]

does what it looks like it does ;-) The entire content of res is shifted right by a slot, and i is plugged into the hole created at the left end. The vast bulk of the time is consumed by a single call to the platform C library's memmove() function, which does the shifting in one mass gulp. Modern hardware and C libraries are very good at moving contiguous slices of memory (which, at the C level, a Python list object is) quickly.

res = [i] + res

does much more under the covers, primarily due to CPython's reference-counting. It's more like:

create a brand new list object
stuff `i` into it
for each element of `res`, which is a pointer to an int object:
    copy the pointer into the new list object
    dereference the pointer to load the int object's refcount
    increment the refcount
    store the new refcount back into the int object
bind the name `res` to the new list object
decrement the refcount on the old `res` object
at which point the old res's refcount becomes 0 so it's trash
so for each object in the old res:
    dereference the pointer to load the int object's refcount
    decrement the refcount
    store the new refcount back into the int object
    check to see whether the new refcount is zero
    take the "no, it isn't zero" branch
release the memory for the old list object

A lot more raw work, and all that pointer dereferencing can leap all over memory, which isn't cache-friendly.

The implementation of

res[0:0] = [i]

skips most of that: it knows from the start that merely shifting the position of res 's contents can't make any net change to the shifted objects' refcounts, so doesn't bother to increment or decrement any of those refcounts. The C-level memmove() is pretty much the whole ball of wax, and none of the pointers to int objects need to be dereferenced. Not only less raw work, but also very cache-friendly.

Answer 2

Running disassembly on the relevant line of each of the examples we get the following bytecode:

res[0:0] = [i]

  4          25 LOAD_FAST                1 (i)
             28 BUILD_LIST               1
             31 LOAD_FAST                0 (res)
             34 LOAD_CONST               2 (0)
             37 LOAD_CONST               2 (0)
             40 BUILD_SLICE              2
             43 STORE_SUBSCR

res = [i] + res

  4          25 LOAD_FAST                1 (i)
             28 BUILD_LIST               1
             31 LOAD_FAST                0 (res)
             34 BINARY_ADD
             35 STORE_FAST               0 (res)

In the first example (slice) there is no BINARY_ADD being done, only a store operation was done, and in the case with the addition there is not only a store operation, there is also a BINARY_ADD operation, which does quite a lot more, which is likely why it is a lot slower. While the slice notation does require building the slice, those operations are also very simple.

For a more fair comparison, we can replace the slice notation by a lookup if it is preconstructed and stored (using something like s = slice(0, 0) ); resulting bytecode looks like this:

res[s] = [i]

  4          25 LOAD_FAST                1 (i)
             28 BUILD_LIST               1
             31 LOAD_FAST                0 (res)
             34 LOAD_GLOBAL              1 (s)
             37 STORE_SUBSCR

Which leaves it with the same number of bytecode instruction count and now we only see load and store instructions, while the one with the + operation require an additional instruction effectively.

Python list prepend time complexity

Question

2 answers

solution1
3 2017-12-27 17:35:28

solution2
1 2017-12-27 07:31:41

Python list prepend time complexity

Question

2 answers

solution1 3 2017-12-27 17:35:28

solution2 1 2017-12-27 07:31:41

solution1
3 2017-12-27 17:35:28

solution2
1 2017-12-27 07:31:41