[英]Fastest method for extracting sub-list from Python list given array of indexes
I have large Python list l
of objects of any type, also I have another large list i
(or even NumPy
array) of integer indexes pointing to some elements within list l
.我有任何类型的对象的大型 Python 列表
l
,我还有另一个指向列表l
某些元素的整数索引的大型列表i
(甚至NumPy
数组)。
The question is what is the fastest (most efficient) way of creating another list l2
which contains elements of l
with indexes from i
.问题是创建另一个列表
l2
的最快(最有效)方法是什么,该列表包含l
元素和来自i
索引。
The easiest way is to do a list comprehension:最简单的方法是做一个列表理解:
l2 = [l[si] for si in i]
# Use np.nditer(i) instead of i, for NumPy array case
But is this the fastest possible way?但这是最快的方法吗?
List comprehension is a Python loop, so might be slow for large lists, maybe there is some built-in Python's method in standard library written in efficient C
to achieve just this task?列表理解是一个 Python 循环,所以对于大型列表来说可能会很慢,也许标准库中有一些用高效
C
编写的内置 Python 方法来实现这个任务? Or maybe in NumPy
there is such method for indexing Python's list by numpy array?或者也许在
NumPy
有这样的方法可以通过 numpy 数组索引 Python 的列表?
Maybe there is some simple and fast function in standard python library for doing this same to NumPy's np.take , like in imaginary code below:也许标准 python 库中有一些简单而快速的函数可以对 NumPy 的np.take做同样的事情,就像下面的假想代码:
import listtools
l2 = listtools.take(l, indexes)
You can get a minor speedup (~25% in the example below) by using operator.itemgetter
which supports bulk lookup:您可以通过使用支持批量查找的
operator.itemgetter
获得较小的加速(在下面的示例中约为 25%):
>>> import string
>>> import random
>>> import operator as op
>>> from timeit import timeit
# create random lists
>>> l = [random.choice([*string.ascii_letters,*range(100)]) for _ in range(1000000)]
>>> i = [random.randint(0,999999) for _ in range(300000)]
# timings
>>> timeit(lambda:[l[si] for si in i],number=100)
3.0997245000035036
>>> timeit(lambda:list(map(l.__getitem__,i)),number=100)
2.892384369013598
>>> timeit(lambda:list(op.itemgetter(*i)(l)),number=100)
2.1787672539940104
It is known that NumPy arrays can also be used to store and process any arbitrary Python objects through dtype = np.object_
.众所周知, NumPy数组还可以通过
dtype = np.object_
用于存储和处理任意 Python 对象。
So I decided to measure NumPy usage speed compared to plain python.所以我决定测量 NumPy 与普通 Python 相比的使用速度。 Also as I mentioned in my question I also want to solve the case when indexes is numpy array of integers.
同样正如我在我的问题中提到的,我还想解决索引是 numpy 整数数组的情况。
Next code measures different cases, whether we need to convert or not source lists to numpy arrays and whether result should be converted too.下一个代码测量不同的情况,我们是否需要将源列表转换为 numpy 数组以及是否也应该转换结果。
import string
from timeit import timeit
import numpy as np
np.random.seed(0)
letters = np.array(list(string.ascii_letters), dtype = np.object_)
nl = letters[np.random.randint(0, len(letters), size = (10 ** 6,))]
l = nl.tolist()
ni = np.random.permutation(np.arange(nl.size, dtype = np.int64))
i = ni.tolist()
pyt = timeit(lambda: [l[si] for si in i], number = 10)
print('python:', round(pyt, 3), flush = True)
for l_from_list in [True, False]:
for i_from_list in [True, False]:
for l_to_list in [True, False]:
def Do():
cl = np.array(l, dtype = np.object_) if l_from_list else nl
ci = np.array(i, dtype = np.int64) if i_from_list else ni
res = cl[ci]
res = res.tolist() if l_to_list else res
return res
ct = timeit(lambda: Do(), number = 10)
print(
'numpy:', 'l_from_list', l_from_list, 'i_from_list', i_from_list, 'l_to_list', l_to_list,
'time', round(ct, 3), 'speedup', round(pyt / ct, 2), flush = True
)
outputs:输出:
python: 2.279
numpy: l_from_list True i_from_list True l_to_list True time 2.924 speedup 0.78
numpy: l_from_list True i_from_list True l_to_list False time 2.805 speedup 0.81
numpy: l_from_list True i_from_list False l_to_list True time 1.457 speedup 1.56
numpy: l_from_list True i_from_list False l_to_list False time 1.312 speedup 1.74
numpy: l_from_list False i_from_list True l_to_list True time 2.352 speedup 0.97
numpy: l_from_list False i_from_list True l_to_list False time 2.209 speedup 1.03
numpy: l_from_list False i_from_list False l_to_list True time 0.894 speedup 2.55
numpy: l_from_list False i_from_list False l_to_list False time 0.75 speedup 3.04
So we can see that if we store all lists as numpy arrays then we gain 3x
speedup!所以我们可以看到,如果我们将所有列表存储为 numpy 数组,那么我们将获得
3x
加速! But if only indexes is a numpy array then we get speedup of just 1.56x
which is also very good.但是如果只有索引是一个 numpy 数组,那么我们的加速仅为
1.56x
,这也非常好。 In the case when everything has to be converted from lists there and back, then we gain speedup of 0.78x
, meaning we slow down, hence if we work with lists only than indexing through numpy is not helpful.如果所有内容都必须从列表来回转换,那么我们的速度提高了
0.78x
,这意味着我们会放慢速度,因此如果我们只使用列表,那么通过 numpy 进行索引是没有帮助的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.