Why such dramatic impacts from using indexes in Python loops?

Question

I'm writing Python courseware and am teaching about nested for loops. I have four homegrown versions of itertools.combinations(iterable, 2) to compare, as follows:

combos.py:

import itertools

# baseline to compare
def combos_v1(iterable):
    return itertools.combinations(iterable, 2)

def combos_v2(iterable):
    a = []

    for elem1 in iterable[:-1]:
        iterable.pop(0)

        for elem2 in iterable:
            a.append((elem1, elem2))

    return a

def combos_v3(iterable):
    a = []

    for idx1, elem1 in enumerate(iterable[:-1]):
        for elem2 in iterable[idx1 + 1:]:
            a.append((elem1, elem2))

    return a

def combos_v4(iterable):
    a = []
    length = len(iterable)

    for idx1, elem1 in enumerate(iterable[:-1]):
        for idx2 in range(idx1 + 1, length):
            a.append((elem1, iterable[idx2]))

    return a

def combos_v5(iterable):
    a = []
    length = len(iterable)

    for idx1 in range(length - 1):
        for idx2 in range(idx1 + 1, length):
            a.append((iterable[idx1], iterable[idx2]))

    return a

I was prepared for the last three versions to be slower than the first two, but not over 1,000 times slower:

time_combos.py:

from timeit import timeit
times = []

for ver in range(1, 6):
    times.append(timeit(
        f"combos_v{ver}(it)",
        f"from combos import combos_v{ver}; it = [42]*100",
        number=10000))

print("Results:")

for ver, time in enumerate(times, start=1):
    print(f"v{ver}: {time} secs")

Output:

Results:
v1: 0.011828199999999999 secs
v2: 0.003176400000000003 secs
v3: 4.159025300000001 secs
v4: 4.5762011000000005 secs
v5: 5.252337500000001 secs

What's happening under the hood that makes the versions that use indexes in some way so much slower than combos_v2 ? I actually thought combos_v3 was going to be the slowest because it copies the list in each iteration of the outer loop, but it's not significantly different from the non-copying versions.

EDIT

I suspect the answer is not because of indexing. Instead, I think the slowdown is from Python's use of unlimited precision integers. The _v3, _v4, and _v5 options all either explicitly or implicitly (in enumerate() ) do integer math, which I believe is more expensive with unlimited precision integers than it is with CPU native integers. I don't have a good way of testing this, though. I can verify that iterating over a range is slower than iterating over a list:

>>> timeit("for __ in range(10000): pass", number=10000)
1.565418199999982
>>> timeit("for __ in it: pass", "it = list(range(10000))", number=10000)
0.6037281999999777

but it's not thousands of times slower, so I'm not sure whether that accounts for what I'm seeing. I can compare iterating over a range vs. indexing without integer math:

>>> timeit("for el in range(10000): el", number=10000)
1.8875144000000432
>>> timeit("for __ in it: it[42]", "it = list(range(10000))", number=10000)
2.2635267

but I think this is an apples-to-oranges comparison. I tried rewriting some of the slow versions using numpy.uint64 integers and they were marginally faster, but I don't know how much overhead I'm incurring from numpy .

Answer 1

This answer was written with two guiding principles:

When you want to answer performance questions, you need to play a game of find the hot loop .
Notice when you are confused : it means that something you believe is wrong.

To start with, let's look at your testing code:

for ver in range(1, 6):
    times.append(timeit(
        f"combos_v{ver}(it)",
        f"from combos import combos_v{ver}; it = [42]*100",
        number=10000))

Your testing code runs combos_vn(it) with the same list object, 10000 times.

`combos_v1`

def combos_v1(iterable):
    return itertools.combinations(iterable, 2)

dis.dis tells us:

  2           0 LOAD_GLOBAL              0 (itertools)
              2 LOAD_METHOD              1 (combinations)
              4 LOAD_FAST                0 (iterable)
              6 LOAD_CONST               1 (2)
              8 CALL_METHOD              2
             10 RETURN_VALUE

Each of these opcodes only runs once, so it's unlikely they contribute significantly to the execution time. The hot loop is in itertools.combinations ; let's see how it's implemented . (I've cut bits out of this code for readability; what's left is more like pseudocode.)

typedef struct {
    PyObject_HEAD
    PyObject *pool;         /* input converted to a tuple */
    Py_ssize_t *indices;    /* one index per result element */
    PyObject *result;       /* most recently returned result tuple */
    Py_ssize_t r;           /* size of result tuple */
    int stopped;            /* set to 1 when the iterator is exhausted */
} combinationsobject;

// in itertools.combinations.__new__
    pool = PySequence_Tuple(iterable);
    if (pool == NULL)
        goto error;

// in itertools.combinations.__next__
    if (result == NULL) {
        /* On the first pass, initialize result tuple using the indices */
        for (i=0; i<r ; i++) {
            index = indices[i];
            elem = PyTuple_GET_ITEM(pool, index);
            Py_INCREF(elem);
            PyTuple_SET_ITEM(result, i, elem);
        }
    } else {
        /* Copy the previous result tuple or re-use it if available */
        if (Py_REFCNT(result) > 1) {
            result = _PyTuple_FromArray(_PyTuple_ITEMS(old_result), r);
            Py_DECREF(old_result);
        }

        /* Now, we've got the only copy so we can update it in-place
         * CPython's empty tuple is a singleton and cached in
         * PyTuple's freelist.
         */
        assert(r == 0 || Py_REFCNT(result) == 1);

        /* Scan indices right-to-left until finding one that is not
           at its maximum (i + n - r). */
        for (i=r-1 ; i >= 0 && indices[i] == i+n-r ; i--)
            ;

        /* If i is negative, then the indices are all at
           their maximum value and we're done. */
        if (i < 0)
            goto empty;

        /* Increment the current index which we know is not at its
           maximum.  Then move back to the right setting each index
           to its lowest possible value (one higher than the index
           to its left -- this maintains the sort order invariant). */
        indices[i]++;
        for (j=i+1 ; j<r ; j++)
            indices[j] = indices[j-1] + 1;

        /* Update the result tuple for the new indices
           starting with i, the leftmost index that changed */
        for ( ; i<r ; i++) {
            index = indices[i];
            elem = PyTuple_GET_ITEM(pool, index);
            Py_INCREF(elem);
            oldelem = PyTuple_GET_ITEM(result, i);
            PyTuple_SET_ITEM(result, i, elem);
            Py_DECREF(oldelem);
        }

We see that itertools.combinations is lazy; it returns an iterator, and does most of its work in the __next__ method of that iterator.

Your testing code runs combos_vn(it) with the same list object, 10000 times.

combos_v1 runs tuple(iterable) in preparation , but since you never iterate over its return value, the combinations don't get calculated. That's why it's so fast: it's only looping over the iterable once (in __new__ )! This isn't a fair comparison.

`combos_v2`

def combos_v2(iterable):
    a = []

    for elem1 in iterable[:-1]:
        iterable.pop(0)

        for elem2 in iterable:
            a.append((elem1, elem2))

    return a

Let's try running this and see what happens.

>>> l = [3, 4, 5]
>>> combos_v2(l)
[(3, 4), (3, 5), (4, 5)]
>>> l
[5]
>>> combos_v2(l)
[]
>>> l
[5]

What happened? Well, iterable.pop(0) happened. This method removes the first item from the list; there's no code to add items back once it's done.

Your testing code runs combos_vn(it) with the same list object, 10000 times.

You've measured the time of 1 call with a 100-element list, and 9999 calls with a 1-element list. This isn't a fair comparison, either.

`combos_v3`

def combos_v3(iterable):
    a = []

    for idx1, elem1 in enumerate(iterable[:-1]):
        for elem2 in iterable[idx1 + 1:]:
            a.append((elem1, elem2))

    return a

I ran dis.dis again; here are the interesting parts:

              6 LOAD_FAST                0 (iterable)
              8 LOAD_CONST               0 (None)
             10 LOAD_CONST               1 (-1)
             12 BUILD_SLICE              2
             14 BINARY_SUBSCR

This corresponds to the code iterable[:-1] . Let's have a look at the implementation of BINARY_SUBSCR of a list and a slice :

static PyObject *
list_slice(PyListObject *a, Py_ssize_t ilow, Py_ssize_t ihigh)
{
    PyListObject *np;
    PyObject **src, **dest;
    Py_ssize_t i, len;
    len = ihigh - ilow;
    if (len <= 0) {
        return PyList_New(0);
    }
    np = (PyListObject *) list_new_prealloc(len);
    if (np == NULL)
        return NULL;

    src = a->ob_item + ilow;
    dest = np->ob_item;
    for (i = 0; i < len; i++) {
        PyObject *v = src[i];
        Py_INCREF(v);
        dest[i] = v;
    }
    Py_SET_SIZE(np, len);
    return (PyObject *)np;
}

It might not look like much, but this little iterable[:-1] is performing an allocation and a copy of 99 elements. The copy's not much – that's only an O(n) loop, not an O(n²) – but it's still something; and allocating memory has a chance of getting the operating system involved, which can be a really expensive operation. (If there's not enough free memory, the OS has to start swapping to disk…)

  5          28 LOAD_FAST                0 (iterable)
             30 LOAD_FAST                2 (idx1)
             32 LOAD_CONST               2 (1)
             34 BINARY_ADD
             36 LOAD_CONST               0 (None)
             38 BUILD_SLICE              2
             40 BINARY_SUBSCR

Another list slice! This one gets run every outer loop ; it's an Θ(n) cost run n-1 times, which is Θ(n²). That's probably significant with a large list… but your list only has 100 elements.

Since the slice-created list's refcount hits 0 before the next list is created, its buffer is freed before a new (one element smaller) one needs to be allocated. Most C runtimes allocate blocks of memory from the OS, and then divide that up when the program needs it; since the memory required is quite small (99 or fewer elements), the C runtime's allocator is probably re-using the same region of pre-allocated memory, meaning the OS never gets involved. The extra copies probably don't even leave the CPU's cache!

We probably didn't need to look at the opcodes at all for this one, because that's not where the computer's spending most of its time, even in theory. To quote Donald Knuth:

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

`combos_v4`

def combos_v4(iterable):
    a = []
    length = len(iterable)

    for idx1, elem1 in enumerate(iterable[:-1]):
        for idx2 in range(idx1 + 1, length):
            a.append((elem1, iterable[idx2]))

    return a

This one, it's probably worth thinking about bytecode for. Compared to combos_v3 , combos_v4 is missing the list slice in the outer loop, so only has 1 extra Θ(n)-sized allocation. In exchange, though, it runs iterable[idx2] Θ(n²) times.

Let's just do an isolated comparison of those two:

>>> from timeit import timeit
>>> timeit('for item in l: item', 'l = list(range(100))')
1.4886210270342417
>>> timeit('for i in range(100): l[i]', 'l = list(range(100))')
3.4677056610235013
>>> timeit('for item in l: item', 'l = list(range(1000))')
14.140142865013331
>>> timeit('for i in range(1000): l[i]', 'l = list(range(1000))')
36.24445745302364

This isn't very rigorous, but it demonstrates the point: iterating over a list with range takes over twice as long as using Python's built-in list iterator. This effect could cancel out the savings from not doing that copy.

`combos_v5`

def combos_v5(iterable):
    a = []
    length = len(iterable)

    for idx1 in range(length - 1):
        for idx2 in range(idx1 + 1, length):
            a.append((iterable[idx1], iterable[idx2]))

    return a

If combos_v4 is slower than combos_v3 , then combos_v5 will probably be slower than both, for the same reason.

Fixed test harness

To test combos_v1 properly, we need to fix that test harness. It needs to consume the iterator; I happen to know that collections.deque has a special-cased code path for doing that as fast as possible if maxlen=0 .

To test combos_v2 properly, we need to reset the list each loop. I can't work out how to do that without it contributing to the time, so I'll just make the list a tuple instead.

from collections import deque
for ver in range(16):
    try:
        times.append(timeit(
            f"deque(combos_v{ver+1}(it), maxlen=0)",
            f"from combos import combos_v{ver+1}; it = tuple([42]*100)",
            number=10000,
            globals=globals()))
    except Exception as e:
        times.append(e)

Improved versions

There are a few ways to improve combos_v2 . The trivial solution is to make it copy the list at the beginning:

def combos_v7(iterable):
    a = []
    l = list(iterable)

    for elem1 in l[:-1]:
        l.pop(0)

        for elem2 in l:
            a.append((elem1, elem2))

    return a

If you do this, you'll notice that this is quite slow. The culprit is l.pop(0) . The implementation of pop is quite heavily optimised and hard to understand, so I'll paraphrase: CPython represents lists as a heap-allocated array of object s. When you remove the first one, it has to move all of the others back by 1. Repeating l.pop(0) until there's only one item left in the list requires Θ(n²) moves.

We can fix this by reversing the list, then popping from the end:

def combos_v8(iterable):
    a = []
    l = list(reversed(iterable))

    for elem1 in reversed(l[1:]):
        l.pop(-1)

        for elem2 in reversed(l):
            a.append((elem1, elem2))

    return a

combo_v5 has a (potential) trivial improvement, too. Instead of doing iterable[idx1] every inner loop, only do it on the outer loop:

def combos_v9(iterable):
    a = []
    length = len(iterable)

    for idx1 in range(length - 1):
        item1 = iterable[idx1]
        for idx2 in range(idx1 + 1, length):
            a.append((item1, iterable[idx2]))

    return a

And finally, the most marginal optimisation I can think of: removing lookups for itertools.combinations in combos_v1 :

from itertools import combinations
def combos_v10(iterable):
    return combinations(iterable, 2)

def combos_v11(iterable, combinations=itertools.combinations):
    return combinations(iterable, 2)

combos_v12 = (lambda c: lambda i: c(i, 2))(itertools.combinations)

from types import FunctionType, CodeType
combos_v13 = FunctionType(CodeType(1, 1, 0, 1, 3, 65, b'd\x00|\x00d\x01\x83\x02S\x00', (itertools.combinations, 2), (), ('iterable',), 'custom', 'combos_v13', 9001, b''), globals())

Other versions

itertools.islice is like a hybrid between the slice version and the range version: it does the iteration in C, with fixed-width C integers, but it doesn't need a copy operation to do so – to "remove" the elements at the beginning, it repeatedly calls next on the iterator.

from itertools import islice

def combos_v14(iterable):
    a = []
    for i, x in enumerate(iterable, start=1):
        for y in islice(iterable, i, None):
            a.append((x, y))
    return a

def combos_v15(iterable):
    return [
        (x, y)
        for i, x in enumerate(iterable, start=1)
        for y in islice(iterable, i, None)
    ]

combos_v15 is just combos_v14 with a list comprehension. I expect that to be slightly faster, for two reasons:

The list doesn't have to be in a valid state while it's being constructed, since it doesn't need to be exposed to Python; CPython might have an optimisation that relies on this.
It's not constantly looking up .append on a .

itertools.islice has to iterate through the whole beginning of the list to get to the point we want. Wouldn't it be nice if we didn't have to do that? Well, I found a nice little implementation detail in the source code:

PyDoc_STRVAR(length_hint_doc, "Private method returning an estimate of len(list(it)).");
PyDoc_STRVAR(reduce_doc, "Return state information for pickling.");
PyDoc_STRVAR(setstate_doc, "Set state information for unpickling.");

static PyMethodDef listiter_methods[] = {
    {"__length_hint__", (PyCFunction)listiter_len, METH_NOARGS, length_hint_doc},
    {"__reduce__", (PyCFunction)listiter_reduce, METH_NOARGS, reduce_doc},
    {"__setstate__", (PyCFunction)listiter_setstate, METH_O, setstate_doc},
    {NULL,              NULL}           /* sentinel */
};

This is for the pickle module, which has a pure-Python version. List iterators can be pickled, which means there's a way to set their start index at runtime ! Let's give it a spin.

>>> l = [3, 4, 5, 6]
>>> i = iter(l)
>>> i.__reduce__()
(<built-in function iter>, ([3, 4, 5, 6],), 0)
>>> next(i)
3
>>> i.__setstate__(0)
>>> next(i)
3
>>> i.__setstate__(2)
>>> list(i)
[5, 6]

Let's use it!

def combos_v16(iterable):
    a = []
    for i, x in enumerate(iterable, start=1):
        list_iter = iter(iterable)
        list_iter.__setstate__(i)
        for y in list_iter:
            a.append((x, y))
    return a

Unfortunately, we can't easily create a list comprehension version of this. I could do it with CPython bytecode, of course, but that's hackish even by my standards. It should probably re-use list_iter instead of creating it anew each loop; I'll add that when I re-do the answer.

We could also drop down to ctypes and try to clone the outermost list iterator. This is a horribly unsafe thing to do, and I'm not confident it'd be any faster. I might add it when I rewrite this answer.

Final comparison

These were run on Python 3.9.2 on a [INSERT ARCH HERE] machine.

100-long list (10000 repeats)

 v1: 0.5226162209874019 secs
 v2: 'tuple' object has no attribute 'pop' secs
 v3: 4.440105640038382 secs
 v4: 5.242078252020292 secs
 v5: 5.847154696006328 secs
 v6: name 'combos_v6' is not defined secs
 v7: 4.175016652967315 secs
 v8: 4.296667369024362 secs
 v9: 5.217006383987609 secs
v10: 0.4311756350216456 secs
v11: 0.4331229960080236 secs
v12: 0.4280035650008358 secs
v13: 0.4272943940013647 secs
v14: 4.696938373963349 secs
v15: 3.6629469400504604 secs
v16: 4.158298808033578 secs

1000-long list (1000 repeats)

I was doing other things on my computer while this was running, so the results are likely noisier.

 v1: 4.355944950017147 secs
 v2: 'tuple' object has no attribute 'pop' secs
 v3: 65.12864276999608 secs
 v4: 80.17008238798007 secs
 v5: 84.949459077965 secs
 v6: name 'combos_v6' is not defined secs
 v7: 69.36186740599805 secs
 v8: 64.24041850597132 secs
 v9: 77.76190268195933 secs
v10: 4.170568476023618 secs
v11: 4.219949325022753 secs
v12: 4.133982014027424 secs
v13: 4.188805791025516 secs
v14: 67.87807830097154 secs
v15: 57.55693690601038 secs
v16: 64.51734905398916 secs

10000-long list (1000 repeats)

(Crashed Python for some reason.)

These results make a lot more sense. No implementation with Python in the inner loop beats an implementation with a pure-C inner loop, but the pure-C implementations are roughly a constant factor faster; and there's not that much difference between various Python implementations. The fastest Python version uses a list comprehension.

Aside: what you're calling an iterable is actually an indexable . I haven't corrected your variable names in my answer, but I probably should.

Why such dramatic impacts from using indexes in Python loops?

Question

1 answers

solution1
2 ACCPTED 2022-07-13 20:46:03

`combos_v1`

`combos_v2`

`combos_v3`

`combos_v4`

`combos_v5`

Fixed test harness

Improved versions

Other versions

Final comparison

100-long list (10000 repeats)

1000-long list (1000 repeats)

10000-long list (1000 repeats)

Why such dramatic impacts from using indexes in Python loops?

Question

1 answers

solution1 2 ACCPTED 2022-07-13 20:46:03

combos_v1

combos_v2

combos_v3

combos_v4

combos_v5

Fixed test harness

Improved versions

Other versions

Final comparison

100-long list (10000 repeats)

1000-long list (1000 repeats)

10000-long list (1000 repeats)

solution1
2 ACCPTED 2022-07-13 20:46:03

`combos_v1`

`combos_v2`

`combos_v3`

`combos_v4`

`combos_v5`