简体   繁体   中英

Fastest method for extracting sub-list from Python list given array of indexes

I have large Python list l of objects of any type, also I have another large list i (or even NumPy array) of integer indexes pointing to some elements within list l .

The question is what is the fastest (most efficient) way of creating another list l2 which contains elements of l with indexes from i .

The easiest way is to do a list comprehension:

l2 = [l[si] for si in i]
# Use np.nditer(i) instead of i, for NumPy array case

But is this the fastest possible way?

List comprehension is a Python loop, so might be slow for large lists, maybe there is some built-in Python's method in standard library written in efficient C to achieve just this task? Or maybe in NumPy there is such method for indexing Python's list by numpy array?

Maybe there is some simple and fast function in standard python library for doing this same to NumPy's np.take , like in imaginary code below:

import listtools
l2 = listtools.take(l, indexes)

You can get a minor speedup (~25% in the example below) by using operator.itemgetter which supports bulk lookup:

>>> import string
>>> import random
>>> import operator as op
>>> from timeit import timeit

# create random lists
>>> l = [random.choice([*string.ascii_letters,*range(100)]) for _ in range(1000000)]
>>> i = [random.randint(0,999999) for _ in range(300000)]

# timings
>>> timeit(lambda:[l[si] for si in i],number=100)
3.0997245000035036
>>> timeit(lambda:list(map(l.__getitem__,i)),number=100)
2.892384369013598
>>> timeit(lambda:list(op.itemgetter(*i)(l)),number=100)
2.1787672539940104

It is known that NumPy arrays can also be used to store and process any arbitrary Python objects through dtype = np.object_ .

So I decided to measure NumPy usage speed compared to plain python. Also as I mentioned in my question I also want to solve the case when indexes is numpy array of integers.

Next code measures different cases, whether we need to convert or not source lists to numpy arrays and whether result should be converted too.

Try it online!

import string
from timeit import timeit
import numpy as np
np.random.seed(0)

letters = np.array(list(string.ascii_letters), dtype = np.object_)
nl = letters[np.random.randint(0, len(letters), size = (10 ** 6,))]
l = nl.tolist()
ni = np.random.permutation(np.arange(nl.size, dtype = np.int64))
i = ni.tolist()

pyt = timeit(lambda: [l[si] for si in i], number = 10)
print('python:', round(pyt, 3), flush = True)

for l_from_list in [True, False]:
    for i_from_list in [True, False]:
        for l_to_list in [True, False]:
            def Do():
                cl = np.array(l, dtype = np.object_) if l_from_list else nl
                ci = np.array(i, dtype = np.int64) if i_from_list else ni
                res = cl[ci]
                res = res.tolist() if l_to_list else res
                return res
            ct = timeit(lambda: Do(), number = 10)
            print(
                'numpy:', 'l_from_list', l_from_list, 'i_from_list', i_from_list, 'l_to_list', l_to_list,
                'time', round(ct, 3), 'speedup', round(pyt / ct, 2), flush = True
            )

outputs:

python: 2.279
numpy: l_from_list True  i_from_list True  l_to_list True  time 2.924 speedup 0.78
numpy: l_from_list True  i_from_list True  l_to_list False time 2.805 speedup 0.81
numpy: l_from_list True  i_from_list False l_to_list True  time 1.457 speedup 1.56
numpy: l_from_list True  i_from_list False l_to_list False time 1.312 speedup 1.74
numpy: l_from_list False i_from_list True  l_to_list True  time 2.352 speedup 0.97
numpy: l_from_list False i_from_list True  l_to_list False time 2.209 speedup 1.03
numpy: l_from_list False i_from_list False l_to_list True  time 0.894 speedup 2.55
numpy: l_from_list False i_from_list False l_to_list False time 0.75  speedup 3.04

So we can see that if we store all lists as numpy arrays then we gain 3x speedup! But if only indexes is a numpy array then we get speedup of just 1.56x which is also very good. In the case when everything has to be converted from lists there and back, then we gain speedup of 0.78x , meaning we slow down, hence if we work with lists only than indexing through numpy is not helpful.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM