简体   繁体   中英

Python: why is multiple list comprehensions seemingly faster than a single for loop with if...elif statements?

I have a bit of code that I am trying to determine if there is a faster way to run it. Essentially, I have a delimited file that I am iterating over to find a set of flags to parse the data. These files can be very long, so I am trying to find a fast method for this.

The two methods I have tried are list comprehension, and a for loop:

Method 1:

flag_set_1 = [i for i,row in enumerate(data_file) if row[0] == flag_1]
flag_set_2 = [i for i,row in enumerate(data_file) if row[0] == flag_2]
flag_set_3 = [i for i,row in enumerate(data_file) if row[0] == flag_3]
flag_set_4 = [i for i,row in enumerate(data_file) if row[0] == flag_4]

Method 2:

for i,row  in enumerate(data_file):
    if row[0] == flag_1:
        flag_set_1.append(i)
    elif row[0] == flag_2:
        flag_set_2.append(i)
    elif row[0] == flag_3:
        flag_set_3.append(i)
    elif row[0] == flag_4:
        flag_set_4.append(i)

I was actually expecting the list comprehension to be slower in this case. Thinking that method 1 would have to iterate over data_file 4 times while method 2 would only have to iterate once. I suspect that the use of append() in method 2 is what is slowing it down.

So I ask, is there a quicker way to implement this?

Without any data sample or benchmark, it's hard too reproduce your observation. I tried with:

from random import randint
data_file = [[randint(0, 15) for _ in range(20)] for _ in range(100000)]
flag_1 = 1
flag_2 = 2
flag_3 = 3
flag_4 = 4

And the regular loop was twice as fast as the four list comprehensions (see benchmark below).

If you want to improve the speed of the process, you have several leads.

List comprehensions and regular loop

If flag_n are strings and you are sure that row[0] is one of these for every row , then you may check one character instead of the whole string. Eg:

flag_1 = "first flag"
flag_2 = "second flag"
flag_3 = "third flag"
flag_4 = "fourth flag"

Look at the second characters: f<I>rst, S<E>cond, T<H>ird, F<O>urth . You just have to check if row[0][1] == 'i' (or 'e' or 'h' or 'o' ) instead of row[0] == flag_n .

Regular loop

If you want to improve the speed of the regular loop, you have several leads.

In all cases

You can assign flag = row[0] instead of getting row[0] the first elements four times. That's basic, but it works.

If you have information about the data

If the data is sorted by flag, you can obviously build the flag_n_set at once: find the first the last flag_n and write flag_n_set = list(range(first_flag_n_index, last_flag_n_index+1)) .

If you know the frequency of the flags, you can order the if... elif... elif... elif... else to first check the more frequent flag, then the second most frequent flag, etc.

You can also use a dict to avoid the if... elif... sequence. If you don't have too many rows that don't match any flag, you can use a defaultdict :

from collections import defaultdict

def test_append_default_dict():
    flag_set = defaultdict(list)

    for i, row  in enumerate(data_file):
        flag_set[row[0]].append(i)

    return tuple(flag_set[f] for f in (flag_1, flag_2, flag_3, flag_4))

Benchmarks with the data above:

test_list_comprehensions    3.8617278739984613
test_append                 1.9978336450003553
test_append_default_dict    1.4595633919998363

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM