简体   繁体   English

从给定索引数组的 Python 列表中提取子列表的最快方法

[英]Fastest method for extracting sub-list from Python list given array of indexes

I have large Python list l of objects of any type, also I have another large list i (or even NumPy array) of integer indexes pointing to some elements within list l .我有任何类型的对象的大型 Python 列表l ,我还有另一个指向列表l某些元素的整数索引的大型列表i (甚至NumPy数组)。

The question is what is the fastest (most efficient) way of creating another list l2 which contains elements of l with indexes from i .问题是创建另一个列表l2的最快(最有效)方法是什么,该列表包含l元素和来自i索引。

The easiest way is to do a list comprehension:最简单的方法是做一个列表理解:

l2 = [l[si] for si in i]
# Use np.nditer(i) instead of i, for NumPy array case

But is this the fastest possible way?但这是最快的方法吗?

List comprehension is a Python loop, so might be slow for large lists, maybe there is some built-in Python's method in standard library written in efficient C to achieve just this task?列表理解是一个 Python 循环,所以对于大型列表来说可能会很慢,也许标准库中有一些用高效C编写的内置 Python 方法来实现这个任务? Or maybe in NumPy there is such method for indexing Python's list by numpy array?或者也许在NumPy有这样的方法可以通过 numpy 数组索引 Python 的列表?

Maybe there is some simple and fast function in standard python library for doing this same to NumPy's np.take , like in imaginary code below:也许标准 python 库中有一些简单而快速的函数可以对 NumPy 的np.take做同样的事情,就像下面的假想代码:

import listtools
l2 = listtools.take(l, indexes)

You can get a minor speedup (~25% in the example below) by using operator.itemgetter which supports bulk lookup:您可以通过使用支持批量查找的operator.itemgetter获得较小的加速(在下面的示例中约为 25%):

>>> import string
>>> import random
>>> import operator as op
>>> from timeit import timeit

# create random lists
>>> l = [random.choice([*string.ascii_letters,*range(100)]) for _ in range(1000000)]
>>> i = [random.randint(0,999999) for _ in range(300000)]

# timings
>>> timeit(lambda:[l[si] for si in i],number=100)
3.0997245000035036
>>> timeit(lambda:list(map(l.__getitem__,i)),number=100)
2.892384369013598
>>> timeit(lambda:list(op.itemgetter(*i)(l)),number=100)
2.1787672539940104

It is known that NumPy arrays can also be used to store and process any arbitrary Python objects through dtype = np.object_ .众所周知, NumPy数组还可以通过dtype = np.object_用于存储和处理任意 Python 对象。

So I decided to measure NumPy usage speed compared to plain python.所以我决定测量 NumPy 与普通 Python 相比的使用速度。 Also as I mentioned in my question I also want to solve the case when indexes is numpy array of integers.同样正如我在我的问题中提到的,我还想解决索引是 numpy 整数数组的情况。

Next code measures different cases, whether we need to convert or not source lists to numpy arrays and whether result should be converted too.下一个代码测量不同的情况,我们是否需要将源列表转换为 numpy 数组以及是否也应该转换结果。

Try it online!在线试试吧!

import string
from timeit import timeit
import numpy as np
np.random.seed(0)

letters = np.array(list(string.ascii_letters), dtype = np.object_)
nl = letters[np.random.randint(0, len(letters), size = (10 ** 6,))]
l = nl.tolist()
ni = np.random.permutation(np.arange(nl.size, dtype = np.int64))
i = ni.tolist()

pyt = timeit(lambda: [l[si] for si in i], number = 10)
print('python:', round(pyt, 3), flush = True)

for l_from_list in [True, False]:
    for i_from_list in [True, False]:
        for l_to_list in [True, False]:
            def Do():
                cl = np.array(l, dtype = np.object_) if l_from_list else nl
                ci = np.array(i, dtype = np.int64) if i_from_list else ni
                res = cl[ci]
                res = res.tolist() if l_to_list else res
                return res
            ct = timeit(lambda: Do(), number = 10)
            print(
                'numpy:', 'l_from_list', l_from_list, 'i_from_list', i_from_list, 'l_to_list', l_to_list,
                'time', round(ct, 3), 'speedup', round(pyt / ct, 2), flush = True
            )

outputs:输出:

python: 2.279
numpy: l_from_list True  i_from_list True  l_to_list True  time 2.924 speedup 0.78
numpy: l_from_list True  i_from_list True  l_to_list False time 2.805 speedup 0.81
numpy: l_from_list True  i_from_list False l_to_list True  time 1.457 speedup 1.56
numpy: l_from_list True  i_from_list False l_to_list False time 1.312 speedup 1.74
numpy: l_from_list False i_from_list True  l_to_list True  time 2.352 speedup 0.97
numpy: l_from_list False i_from_list True  l_to_list False time 2.209 speedup 1.03
numpy: l_from_list False i_from_list False l_to_list True  time 0.894 speedup 2.55
numpy: l_from_list False i_from_list False l_to_list False time 0.75  speedup 3.04

So we can see that if we store all lists as numpy arrays then we gain 3x speedup!所以我们可以看到,如果我们将所有列表存储为 numpy 数组,那么我们将获得3x加速! But if only indexes is a numpy array then we get speedup of just 1.56x which is also very good.但是如果只有索引是一个 numpy 数组,那么我们的加速仅为1.56x ,这也非常好。 In the case when everything has to be converted from lists there and back, then we gain speedup of 0.78x , meaning we slow down, hence if we work with lists only than indexing through numpy is not helpful.如果所有内容都必须从列表来回转换,那么我们的速度提高了0.78x ,这意味着我们会放慢速度,因此如果我们只使用列表,那么通过 numpy 进行索引是没有帮助的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 Python 中的子列表中获取元素 - Getting elements from a sub-list in Python 检查子列表是否在大列表中的最快方法 - The fastest way to check if the sub-list exists on the large list 如何从列表中删除子列表? - How to remove a sub-list from a list? 如果我只有少数子列表元素,如何从列表中删除整个子列表? 蟒蛇 - How do I remove whole sub-list from list, if i have only few element of the sub-list? python 如何从给定的列表中选择所有可能的子列表对,它们的联合将成为列表? - How to select all possible sub-list pairs from a given list whose union will be the list? 字典子列表的迭代-Python - Iteration over a sub-list of dictionaries -Python 给定一个包含它的列表在 Python 中找到一个项目的索引的最快方法是什么? - Whats the fastest way of finding the indexes of an item given a list containing it in Python? 在给定索引列表的情况下将多行插入数据帧的最快方法(python) - fastest way to insert multiple rows into a dataframe given a list of indexes (python) python中的列表匹配:获取更大列表中的子列表的索引 - list match in python: get indices of a sub-list in a larger list Python代码按该子列表第一个元素的索引访问子列表 - Python code to access sub-List by index of first element of that sub-List
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM