简体   繁体   English

迭代特定列表的最快方法?

[英]Fastest possible way to iterate through a specific list?

Let's say I have a list: 假设我有一个清单:

list=['plu;ean;price;quantity','plu1;ean1;price1;quantity1']

I want to iterate over the list + split the list by ";" 我想迭代列表+用“;”拆分列表 and put an if clause, like this: 并添加一个if子句,如下所示:

for item in list:
    split_item=item.split(";")
    if split_item[0] == "string_value" or split_item[1] == "string_value":
        do something.....

I was wondering, if this is the fastest way possible? 我想知道,如果这是最快的方法吗? Let's say my initial list is a lot bigger (has a lot more list items). 假设我的初始列表更大(包含更多列表项)。 I tried with list comprehensions: 我尝试了列表推导:

item=[item.split(";") for item in list if item.split(";")[0] == "string_value" or item.split(";")[1] == "string_value"]

But this is actually giving me slower results. 但这实际上给了我较慢的结果。 The first case is giving me an average of 90ms, while the second one is giving me an average of 130ms. 第一种情况是给我平均90ms,而第二种情况给我平均130ms。 Am I doing the list comprehension wrong? 我做列表理解错了吗? Is there a faster solution? 有更快的解决方案吗?

I was wondering, if this is the fastest way possible? 我想知道,如果这是最快的方法吗?

No, of course not. 不,当然不。 You can implement it a lot faster in hand-coded assembly than in Python. 您可以在手动编码的程序集中比在Python中更快地实现它。 So what? 所以呢?

If the "do something..." is not trivial, and there are many matches, the cost to do something 100000 times is going to be a lot more expensive than the cost of looping 500000 times, so finding the fastest way to loop doesn't matter at all. 如果“做某事......”并不是微不足道的,并且有很多匹配,那么做100000次的成本要比循环500000次的成本贵很多,所以找到最快的循环方式什么都不重要。

In fact, just calling split two to three each loop instead of remembering and reusing the result is going to swamp the cost of iteration, and not passing a maxsplit argument when you only care about two results may as well. 事实上,只是调用每个循环split两到三个而不是记住和重用结果将会淹没迭代的成本,而当你只关心两个结果时也不会传递一个maxsplit参数。


So, you're trying to optimize the wrong thing. 所以,你正试图优化错误的东西。 But what if, after you fix everything else, it turns out that the cost of iteration really does matter here? 但是,如果在你解决了其他问题之后,事实证明迭代的成本在这里确实很重要呢?

Well, you can't use a comprehension directly to speed things up, because comprehensions are for expressions that return values, not statements to do things. 好吧,你不能直接使用理解来加快速度,因为理解是针对返回值的表达式,而不是用于做事情的语句。

But, if you look at your code, you'll realize you're actually doing three things: splitting each string, then filtering out the ones that don't match, then doing the "do something". 但是,如果你看看你的代码,你会发现你实际上做了三件事:分割每个字符串,然后过滤掉那些不匹配的字符串,然后做“做某事”。 So, you can use a comprehension for the first two parts, and then you're only using a slow for loop for the much smaller list of values that passed the filter. 因此,您可以使用前两个部分的理解,然后您只使用缓慢的for循环来获得通过过滤器的小得多的值列表。

It looks like you tried this, but you made two mistakes. 看起来你试过这个,但是你犯了两个错误。

First, you're better off with a generator expression than a list comprehension—you don't need a list here, just something to iterator over, so don't pay to build one. 首先,你最好使用生成器表达式而不是列表理解 - 你不需要这里的列表,只需要迭代的东西,所以不要付钱来构建一个。

Second, you don't want to split the string three times. 其次,您不希望将字符串split三次。 You can probably find some convoluted way to get the split done once in a single comprehension, but why bother? 你可以找到一些令人费解的方法,在一次理解中完成一次split ,但为什么要费心呢? Just write each step as its own step. 只需将每个步骤写为自己的步骤。

So: 所以:

split_items = (item.split(';') for item in items)
filtered_items = (item for item in split_items 
                  if item[0] == "string_value" or item[1] == "string_value")
for item in filtered_items:
    do something...

Will this actually be faster? 这实际上会更快吗? If you can get some real test data, and "do something..." code, that shows that the iteration is a bottleneck, you can test on that real data and code. 如果您可以获得一些真实的测试数据,并且“做某事......”代码,这表明迭代是一个瓶颈,您可以测试该真实数据和代码。 Until then, there's nothing to test. 在那之前,没有什么可以测试的。

Split the whole string only when the first two items retrieved from str.split(';', 2) satisfy the conditions: 仅当从str.split(';', 2)检索的前两项满足条件时才拆分整个字符串:

>>> strs = 'plu;ean;price;quantity'
>>> strs.split(';', 2)
['plu', 'ean', 'price;quantity']

Here split the third item( 'price;quantity' ) only if the first two items have satisfied the condition: 只有当前两个项满足条件时,才会拆分第三个项目( 'price;quantity' ):

>>> lis = ['plu;ean;price;quantity'*1000, 'plu1;ean1;price1;quantity1'*1000]*1000

Normal for-loop, single split of whole string for each item of the list. 正常for循环,单个分割整个字符串列表的每个项目。

>>> %%timeit
for item in lis:
    split_item=item.split(";")
    if split_item[0] == "plu" or split_item[1] == "ean":pass
... 
1 loops, best of 3: 952 ms per loop

List comprehension equivalent to the for-loop above: 列表理解等效于上面的for循环:

>>> %timeit [x for x in (item.split(';') for item in lis) if x[0]== "plu" or x[1]=="ean"]
1 loops, best of 3: 961 ms per loop

Split on-demand: 按需拆分:

>>> %timeit [[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== "plu" or y=="ean"]
1 loops, best of 3: 508 ms per loop

Of course, if the list and strings are small then such optimisation doesn't matter. 当然,如果列表和字符串很小,那么这种优化并不重要。

I found a good alternative here. 我在这里找到了一个很好的选择

You can use a combination of map and filter. 您可以使用地图和过滤器的组合。 Try this: 试试这个:

>>>import itertools
>>>splited_list = itertools.imap(lambda x: x.split(";"), your_list)
>>>result = filter(lambda x: filter(lambda x: x[0] == "plu" or x[1] == "string_value", lista)

The first item will create a iterator of elements. 第一项将创建元素的迭代器。 And The second one will filter it. 而第二个将过滤它。 I run a small benchmark in my IPython Notebook shell, and got the following results: 我在我的IPython Notebook shell中运行了一个小基准测试,得到了以下结果:

1st test: 第一次测试:

在此输入图像描述

With small sizes, the one-line solution works better 小尺寸,单线解决方案效果更好

2nd test: 第二次测试:

在此输入图像描述

With a bigger list, the map/filter solution is slightly better 使用更大的列表,地图/过滤器解决方案稍微好一些

3rd test: 第3次测试:

在此输入图像描述

With a big list and bigger elements, the map/filter solution it`s way better. 凭借大清单和更大的元素,地图/过滤器解决方案更好。

I guess the difference in performance continues increasing as the size of the list goes by, untill peaks in 66% more time (in a 10000 elements list trial). 我认为随着列表的大小逐渐增加,性能上的差异会继续增加,直到66%的时间达到峰值(在10000元素列表试验中)。

The difference between the map/filter solution and the list comprehension solutions is the number of calls to .split(). map / filter解决方案和列表推导解决方案之间的区别在于.split()的调用次数。 Ones calls it 3 times for each item, the other just one, because list comprehensions are just a pythonic way to do map/filter together. Ones为每个项目调用它3次,另一个只调用一次,因为列表推导只是一种pythonic方式来一起进行map / filter。 I used to use list comprehensions a lot, and thought that i don't knew what the lambda was all about. 我过去经常使用列表推导,并认为我不知道lambda是什么。 Untill i discovered that map and list comprehensions are the same thing. 直到我发现地图和列表推导是一回事。

If you don't care about memory usage, you can use regular map instead of imap. 如果您不关心内存使用情况,可以使用常规地图而不是imap。 It will create the list with splits at once. 它将立即创建包含拆分的列表。 It will use more memory to store it, but its slightly faster. 它会使用更多内存来存储它,但速度稍快。

Actually, if you don't care about memory usage, you can write the map/filter solution using 2 list comprehensions, and get the same exact result. 实际上,如果您不关心内存使用情况,可以使用2个列表推导来编写地图/过滤器解决方案,并获得相同的结果。 Checkout: 查看:

在此输入图像描述

EDIT: It turns out that the Regex cache was being a bit unfair to the competition. 编辑:事实证明,正则表达式缓存对竞争有点不公平。 My bad. 我的错。 Regex is only a small percentage faster. 正则表达式只有一小部分更快。

If you're looking for speed, hcwhsa's answer should be good enough. 如果你正在寻找速度,hcwhsa的答案应该足够好。 If you need slightly more, look to re . 如果您需要稍多,看向re

import re
from itertools import chain

lis = ['plu;ean;price;quantity'*1000, 'plu1;ean1;price1;quantity1'*100]*1000

matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match
[l.split(';') for l in lis if matcher(l)]

Timings, for mostly positive results (aka. split is the major cause of slowness): 计时,主要是积极的结果(又称split是缓慢的主要原因):

SETUP="
import re
from itertools import chain
matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match

lis = ['plu1;ean1;price1;quantity1'+chr(i) for i in range(10000)] + ['plu;ean;price;quantity' for i in range(10000)]
"

python -m timeit -s "$SETUP" "[[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean']"
python -m timeit -s "$SETUP" "[l.split(';') for l in lis if matcher(l)]"

We see mine's a little faster. 我们看到我的快一点。

10 loops, best of 3: 55 msec per loop
10 loops, best of 3: 49.5 msec per loop

For mostly negative results (most things are filtered): 对于大多数负面结果(大多数事情被过滤):

SETUP="
import re
from itertools import chain
matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match

lis = ['plu1;ean1;price1;quantity1'+chr(i) for i in range(1000)] + ['plu;ean;price;quantity' for i in range(10000)]
"

python -m timeit -s "$SETUP" "[[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean']"
python -m timeit -s "$SETUP" "[l.split(';') for l in lis if matcher(l)]"

The lead's a touch higher. 铅是一种触摸高。

10 loops, best of 3: 40.9 msec per loop
10 loops, best of 3: 35.7 msec per loop

If the result will always be unique, use 如果结果始终是唯一的,请使用

next([x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean')

or the faster Regex version 或更快的正则表达式版本

next(filter(matcher, lis)).split(';')

(use itertools.ifilter on Python 2). (在Python 2上使用itertools.ifilter )。

Timings: 时序:

SETUP="
import re
from itertools import chain
matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match

lis = ['plu1;ean1;price1;quantity1'+chr(i) for i in range(10000)] + ['plu;ean;price;quantity'] + ['plu1;ean1;price1;quantity1'+chr(i) for i in range(10000)]
"

python -m timeit -s "$SETUP" "[[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean']"
python -m timeit -s "$SETUP" "next([x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean')"

python -m timeit -s "$SETUP" "[l.split(';') for l in lis if matcher(l)]"
python -m timeit -s "$SETUP" "next(filter(matcher, lis)).split(';')"

Results: 结果:

10 loops, best of 3: 31.3 msec per loop
100 loops, best of 3: 15.2 msec per loop
10 loops, best of 3: 28.8 msec per loop
100 loops, best of 3: 14.1 msec per loop

So this gives a substantial boost to both methods. 因此,这对两种方法都有很大的推动作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM