I have a bit of code that I am trying to determine if there is a faster way to run it. Essentially, I have a delimited file that I am iterating over to find a set of flags to parse the data. These files can be very long, so I am trying to find a fast method for this.
The two methods I have tried are list comprehension, and a for loop:
Method 1:
flag_set_1 = [i for i,row in enumerate(data_file) if row[0] == flag_1]
flag_set_2 = [i for i,row in enumerate(data_file) if row[0] == flag_2]
flag_set_3 = [i for i,row in enumerate(data_file) if row[0] == flag_3]
flag_set_4 = [i for i,row in enumerate(data_file) if row[0] == flag_4]
Method 2:
for i,row in enumerate(data_file):
if row[0] == flag_1:
flag_set_1.append(i)
elif row[0] == flag_2:
flag_set_2.append(i)
elif row[0] == flag_3:
flag_set_3.append(i)
elif row[0] == flag_4:
flag_set_4.append(i)
I was actually expecting the list comprehension to be slower in this case. Thinking that method 1 would have to iterate over data_file 4 times while method 2 would only have to iterate once. I suspect that the use of append() in method 2 is what is slowing it down.
So I ask, is there a quicker way to implement this?
Without any data sample or benchmark, it's hard too reproduce your observation. I tried with:
from random import randint
data_file = [[randint(0, 15) for _ in range(20)] for _ in range(100000)]
flag_1 = 1
flag_2 = 2
flag_3 = 3
flag_4 = 4
And the regular loop was twice as fast as the four list comprehensions (see benchmark below).
If you want to improve the speed of the process, you have several leads.
If flag_n
are strings and you are sure that row[0]
is one of these for every row
, then you may check one character instead of the whole string. Eg:
flag_1 = "first flag"
flag_2 = "second flag"
flag_3 = "third flag"
flag_4 = "fourth flag"
Look at the second characters: f<I>rst, S<E>cond, T<H>ird, F<O>urth
. You just have to check if row[0][1] == 'i'
(or 'e'
or 'h'
or 'o'
) instead of row[0] == flag_n
.
If you want to improve the speed of the regular loop, you have several leads.
You can assign flag = row[0]
instead of getting row[0]
the first elements four times. That's basic, but it works.
If the data is sorted by flag, you can obviously build the flag_n_set
at once: find the first the last flag_n
and write flag_n_set = list(range(first_flag_n_index, last_flag_n_index+1))
.
If you know the frequency of the flags, you can order the if... elif... elif... elif... else
to first check the more frequent flag, then the second most frequent flag, etc.
You can also use a dict to avoid the if... elif...
sequence. If you don't have too many rows that don't match any flag, you can use a defaultdict
:
from collections import defaultdict
def test_append_default_dict():
flag_set = defaultdict(list)
for i, row in enumerate(data_file):
flag_set[row[0]].append(i)
return tuple(flag_set[f] for f in (flag_1, flag_2, flag_3, flag_4))
Benchmarks with the data above:
test_list_comprehensions 3.8617278739984613
test_append 1.9978336450003553
test_append_default_dict 1.4595633919998363
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.