Fastest possible way to iterate through a specific list?

Question

Let's say I have a list:

list=['plu;ean;price;quantity','plu1;ean1;price1;quantity1']

I want to iterate over the list + split the list by ";" and put an if clause, like this:

for item in list:
    split_item=item.split(";")
    if split_item[0] == "string_value" or split_item[1] == "string_value":
        do something.....

I was wondering, if this is the fastest way possible? Let's say my initial list is a lot bigger (has a lot more list items). I tried with list comprehensions:

item=[item.split(";") for item in list if item.split(";")[0] == "string_value" or item.split(";")[1] == "string_value"]

But this is actually giving me slower results. The first case is giving me an average of 90ms, while the second one is giving me an average of 130ms. Am I doing the list comprehension wrong? Is there a faster solution?

Answer 1

I was wondering, if this is the fastest way possible?

No, of course not. You can implement it a lot faster in hand-coded assembly than in Python. So what?

If the "do something..." is not trivial, and there are many matches, the cost to do something 100000 times is going to be a lot more expensive than the cost of looping 500000 times, so finding the fastest way to loop doesn't matter at all.

In fact, just calling split two to three each loop instead of remembering and reusing the result is going to swamp the cost of iteration, and not passing a maxsplit argument when you only care about two results may as well.

So, you're trying to optimize the wrong thing. But what if, after you fix everything else, it turns out that the cost of iteration really does matter here?

Well, you can't use a comprehension directly to speed things up, because comprehensions are for expressions that return values, not statements to do things.

But, if you look at your code, you'll realize you're actually doing three things: splitting each string, then filtering out the ones that don't match, then doing the "do something". So, you can use a comprehension for the first two parts, and then you're only using a slow for loop for the much smaller list of values that passed the filter.

It looks like you tried this, but you made two mistakes.

First, you're better off with a generator expression than a list comprehension—you don't need a list here, just something to iterator over, so don't pay to build one.

Second, you don't want to split the string three times. You can probably find some convoluted way to get the split done once in a single comprehension, but why bother? Just write each step as its own step.

So:

split_items = (item.split(';') for item in items)
filtered_items = (item for item in split_items 
                  if item[0] == "string_value" or item[1] == "string_value")
for item in filtered_items:
    do something...

Will this actually be faster? If you can get some real test data, and "do something..." code, that shows that the iteration is a bottleneck, you can test on that real data and code. Until then, there's nothing to test.

Answer 2

Split the whole string only when the first two items retrieved from str.split(';', 2) satisfy the conditions:

>>> strs = 'plu;ean;price;quantity'
>>> strs.split(';', 2)
['plu', 'ean', 'price;quantity']

Here split the third item( 'price;quantity' ) only if the first two items have satisfied the condition:

>>> lis = ['plu;ean;price;quantity'*1000, 'plu1;ean1;price1;quantity1'*1000]*1000

Normal for-loop, single split of whole string for each item of the list.

>>> %%timeit
for item in lis:
    split_item=item.split(";")
    if split_item[0] == "plu" or split_item[1] == "ean":pass
... 
1 loops, best of 3: 952 ms per loop

List comprehension equivalent to the for-loop above:

>>> %timeit [x for x in (item.split(';') for item in lis) if x[0]== "plu" or x[1]=="ean"]
1 loops, best of 3: 961 ms per loop

Split on-demand:

>>> %timeit [[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== "plu" or y=="ean"]
1 loops, best of 3: 508 ms per loop

Of course, if the list and strings are small then such optimisation doesn't matter.

Answer 3

I found a good alternative here.

You can use a combination of map and filter. Try this:

>>>import itertools
>>>splited_list = itertools.imap(lambda x: x.split(";"), your_list)
>>>result = filter(lambda x: filter(lambda x: x[0] == "plu" or x[1] == "string_value", lista)

The first item will create a iterator of elements. And The second one will filter it. I run a small benchmark in my IPython Notebook shell, and got the following results:

1st test:

在此输入图像描述

With small sizes, the one-line solution works better

2nd test:

在此输入图像描述

With a bigger list, the map/filter solution is slightly better

3rd test:

在此输入图像描述

With a big list and bigger elements, the map/filter solution it`s way better.

I guess the difference in performance continues increasing as the size of the list goes by, untill peaks in 66% more time (in a 10000 elements list trial).

The difference between the map/filter solution and the list comprehension solutions is the number of calls to .split(). Ones calls it 3 times for each item, the other just one, because list comprehensions are just a pythonic way to do map/filter together. I used to use list comprehensions a lot, and thought that i don't knew what the lambda was all about. Untill i discovered that map and list comprehensions are the same thing.

If you don't care about memory usage, you can use regular map instead of imap. It will create the list with splits at once. It will use more memory to store it, but its slightly faster.

Actually, if you don't care about memory usage, you can write the map/filter solution using 2 list comprehensions, and get the same exact result. Checkout:

在此输入图像描述

Answer 4

EDIT: It turns out that the Regex cache was being a bit unfair to the competition. My bad. Regex is only a small percentage faster.

If you're looking for speed, hcwhsa's answer should be good enough. If you need slightly more, look to re .

import re
from itertools import chain

lis = ['plu;ean;price;quantity'*1000, 'plu1;ean1;price1;quantity1'*100]*1000

matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match
[l.split(';') for l in lis if matcher(l)]

Timings, for mostly positive results (aka. split is the major cause of slowness):

SETUP="
import re
from itertools import chain
matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match

lis = ['plu1;ean1;price1;quantity1'+chr(i) for i in range(10000)] + ['plu;ean;price;quantity' for i in range(10000)]
"

python -m timeit -s "$SETUP" "[[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean']"
python -m timeit -s "$SETUP" "[l.split(';') for l in lis if matcher(l)]"

We see mine's a little faster.

10 loops, best of 3: 55 msec per loop
10 loops, best of 3: 49.5 msec per loop

For mostly negative results (most things are filtered):

SETUP="
import re
from itertools import chain
matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match

lis = ['plu1;ean1;price1;quantity1'+chr(i) for i in range(1000)] + ['plu;ean;price;quantity' for i in range(10000)]
"

python -m timeit -s "$SETUP" "[[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean']"
python -m timeit -s "$SETUP" "[l.split(';') for l in lis if matcher(l)]"

The lead's a touch higher.

10 loops, best of 3: 40.9 msec per loop
10 loops, best of 3: 35.7 msec per loop

If the result will always be unique, use

next([x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean')

or the faster Regex version

next(filter(matcher, lis)).split(';')

(use itertools.ifilter on Python 2).

Timings:

SETUP="
import re
from itertools import chain
matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match

lis = ['plu1;ean1;price1;quantity1'+chr(i) for i in range(10000)] + ['plu;ean;price;quantity'] + ['plu1;ean1;price1;quantity1'+chr(i) for i in range(10000)]
"

python -m timeit -s "$SETUP" "[[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean']"
python -m timeit -s "$SETUP" "next([x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean')"

python -m timeit -s "$SETUP" "[l.split(';') for l in lis if matcher(l)]"
python -m timeit -s "$SETUP" "next(filter(matcher, lis)).split(';')"

Results:

10 loops, best of 3: 31.3 msec per loop
100 loops, best of 3: 15.2 msec per loop
10 loops, best of 3: 28.8 msec per loop
100 loops, best of 3: 14.1 msec per loop

So this gives a substantial boost to both methods.

Fastest possible way to iterate through a specific list?

Question

4 answers

solution1
3 2013-09-25 18:57:31

solution2
1 2013-09-25 18:53:33

solution3
1 2013-09-25 19:49:34

solution4
1 ACCPTED 2013-09-25 20:24:51

EDIT: It turns out that the Regex cache was being a bit unfair to the competition. My bad. Regex is only a small percentage faster.

Fastest possible way to iterate through a specific list?

Question

4 answers

solution1 3 2013-09-25 18:57:31

solution2 1 2013-09-25 18:53:33

solution3 1 2013-09-25 19:49:34

solution4 1 ACCPTED 2013-09-25 20:24:51

EDIT: It turns out that the Regex cache was being a bit unfair to the competition. My bad. Regex is only a small percentage faster.

solution1
3 2013-09-25 18:57:31

solution2
1 2013-09-25 18:53:33

solution3
1 2013-09-25 19:49:34

solution4
1 ACCPTED 2013-09-25 20:24:51