从给定索引数组的 Python 列表中提取子列表的最快方法

Question

I have large Python list l of objects of any type, also I have another large list i (or even NumPy array) of integer indexes pointing to some elements within list l .我有任何类型的对象的大型 Python 列表l ，我还有另一个指向列表l某些元素的整数索引的大型列表i （甚至NumPy数组）。

The question is what is the fastest (most efficient) way of creating another list l2 which contains elements of l with indexes from i .问题是创建另一个列表l2的最快（最有效）方法是什么，该列表包含l元素和来自i索引。

The easiest way is to do a list comprehension:最简单的方法是做一个列表理解：

l2 = [l[si] for si in i]
# Use np.nditer(i) instead of i, for NumPy array case

But is this the fastest possible way?但这是最快的方法吗？

List comprehension is a Python loop, so might be slow for large lists, maybe there is some built-in Python's method in standard library written in efficient C to achieve just this task?列表理解是一个 Python 循环，所以对于大型列表来说可能会很慢，也许标准库中有一些用高效C编写的内置 Python 方法来实现这个任务？ Or maybe in NumPy there is such method for indexing Python's list by numpy array?或者也许在NumPy有这样的方法可以通过 numpy 数组索引 Python 的列表？

Maybe there is some simple and fast function in standard python library for doing this same to NumPy's np.take , like in imaginary code below:也许标准 python 库中有一些简单而快速的函数可以对 NumPy 的np.take做同样的事情，就像下面的假想代码：

import listtools
l2 = listtools.take(l, indexes)

Answer 1

You can get a minor speedup (~25% in the example below) by using operator.itemgetter which supports bulk lookup:您可以通过使用支持批量查找的operator.itemgetter获得较小的加速（在下面的示例中约为 25%）：

>>> import string
>>> import random
>>> import operator as op
>>> from timeit import timeit

# create random lists
>>> l = [random.choice([*string.ascii_letters,*range(100)]) for _ in range(1000000)]
>>> i = [random.randint(0,999999) for _ in range(300000)]

# timings
>>> timeit(lambda:[l[si] for si in i],number=100)
3.0997245000035036
>>> timeit(lambda:list(map(l.__getitem__,i)),number=100)
2.892384369013598
>>> timeit(lambda:list(op.itemgetter(*i)(l)),number=100)
2.1787672539940104

Answer 2

It is known that NumPy arrays can also be used to store and process any arbitrary Python objects through dtype = np.object_ .众所周知， NumPy数组还可以通过dtype = np.object_用于存储和处理任意 Python 对象。

So I decided to measure NumPy usage speed compared to plain python.所以我决定测量 NumPy 与普通 Python 相比的使用速度。 Also as I mentioned in my question I also want to solve the case when indexes is numpy array of integers.同样正如我在我的问题中提到的，我还想解决索引是 numpy 整数数组的情况。

Next code measures different cases, whether we need to convert or not source lists to numpy arrays and whether result should be converted too.下一个代码测量不同的情况，我们是否需要将源列表转换为 numpy 数组以及是否也应该转换结果。

Try it online!在线试试吧！

import string
from timeit import timeit
import numpy as np
np.random.seed(0)

letters = np.array(list(string.ascii_letters), dtype = np.object_)
nl = letters[np.random.randint(0, len(letters), size = (10 ** 6,))]
l = nl.tolist()
ni = np.random.permutation(np.arange(nl.size, dtype = np.int64))
i = ni.tolist()

pyt = timeit(lambda: [l[si] for si in i], number = 10)
print('python:', round(pyt, 3), flush = True)

for l_from_list in [True, False]:
    for i_from_list in [True, False]:
        for l_to_list in [True, False]:
            def Do():
                cl = np.array(l, dtype = np.object_) if l_from_list else nl
                ci = np.array(i, dtype = np.int64) if i_from_list else ni
                res = cl[ci]
                res = res.tolist() if l_to_list else res
                return res
            ct = timeit(lambda: Do(), number = 10)
            print(
                'numpy:', 'l_from_list', l_from_list, 'i_from_list', i_from_list, 'l_to_list', l_to_list,
                'time', round(ct, 3), 'speedup', round(pyt / ct, 2), flush = True
            )

outputs:输出：

python: 2.279
numpy: l_from_list True  i_from_list True  l_to_list True  time 2.924 speedup 0.78
numpy: l_from_list True  i_from_list True  l_to_list False time 2.805 speedup 0.81
numpy: l_from_list True  i_from_list False l_to_list True  time 1.457 speedup 1.56
numpy: l_from_list True  i_from_list False l_to_list False time 1.312 speedup 1.74
numpy: l_from_list False i_from_list True  l_to_list True  time 2.352 speedup 0.97
numpy: l_from_list False i_from_list True  l_to_list False time 2.209 speedup 1.03
numpy: l_from_list False i_from_list False l_to_list True  time 0.894 speedup 2.55
numpy: l_from_list False i_from_list False l_to_list False time 0.75  speedup 3.04

So we can see that if we store all lists as numpy arrays then we gain 3x speedup!所以我们可以看到，如果我们将所有列表存储为 numpy 数组，那么我们将获得3x加速！ But if only indexes is a numpy array then we get speedup of just 1.56x which is also very good.但是如果只有索引是一个 numpy 数组，那么我们的加速仅为1.56x ，这也非常好。 In the case when everything has to be converted from lists there and back, then we gain speedup of 0.78x , meaning we slow down, hence if we work with lists only than indexing through numpy is not helpful.如果所有内容都必须从列表来回转换，那么我们的速度提高了0.78x ，这意味着我们会放慢速度，因此如果我们只使用列表，那么通过 numpy 进行索引是没有帮助的。

从给定索引数组的 Python 列表中提取子列表的最快方法

问题描述

2 个解决方案

解决方案1
4 已采纳 2020-10-03 08:37:26

解决方案2
3 2020-10-03 10:15:16

从给定索引数组的 Python 列表中提取子列表的最快方法

问题描述

2 个解决方案

解决方案1 4 已采纳 2020-10-03 08:37:26

解决方案2 3 2020-10-03 10:15:16

解决方案1
4 已采纳 2020-10-03 08:37:26

解决方案2
3 2020-10-03 10:15:16