简体   繁体   中英

Optimizing execution-time to check if chars of a word are in a list python

I am writing python2.7.15 code to access chars inside a word. How can I optimize this process, in order to check also if every word is contained inside an external list?

I have tried two versions of python2 code: version(1) is an extended version of what my code has to do, whereas in version (2) I tried a compact version of the same code.

chars_array = ['a','b','c']
VERSION (1)
def version1(word):
    chars =[x for x in word]
    count = 0

    for c in chars:
        if not c in chars_array:
            count+=1

    return count
VERSION (2)
def version2(word):
    return sum([1 for c in [x for x in word] if not c in chars_array])

I am analyzing a large corpus and for version1 I obtain an execution time of 8.56 sec, whereas for version2 it is 8.12 sec.

The fastest solution (can be up to 100x faster for an extremely long string):

joined = ''.join(chars_array)
def version3(word):
    return len(word.translate(None, joined))

Another slower solution that is approximately the same speed as your code:

from itertools import ifilterfalse
def version4(word):
    return sum(1 for _ in ifilterfalse(set(chars_array).__contains__, word))

Timings ( s is a random string):

In [17]: %timeit version1(s)
1000 loops, best of 3: 79.9 µs per loop

In [18]: %timeit version2(s)
10000 loops, best of 3: 98.1 µs per loop

In [19]: %timeit version3(s)
100000 loops, best of 3: 4.12 µs per loop # <- fastest

In [20]: %timeit version4(s)
10000 loops, best of 3: 84.3 µs per loop

With chars_array = ['a', 'e', 'i', 'o', 'u', 'y'] and words equal to a list of 56048 English words, I measured a number of variants with a command similar to the following at an IPython prompt:

%timeit n = [version1(word) for word in words]

In each case it reported "10 loops, best of 3", and I have shown the time per loop in comments next to each function definition below:

# OP's originals:

def version1(word):  # 163 ms
    chars =[x for x in word]
    count = 0
    for c in chars:
        if not c in chars_array:
            count+=1
    return count

def version2(word):  # 173 ms
    return sum([1 for c in [x for x in word] if not c in chars_array])

Now let's hit version1 and version2 with three optimizations:

  • remove the redundant list comprehension and iterate through word directly instead;
  • use the operator not in rather than negating the result of the in operator;
  • check for (non-)membership of a set rather than a list .

_

chars_set = set(chars_array)

def version1a(word):  # 95.5 ms
    count = 0
    for c in word:
        if c not in chars_set:
            count+=1
    return count

def version2a(word):  # 104 ms
    return sum([1 for c in word if c not in chars_set])

So there's actually an advantage for the multi-line code over the list comprehension. This may depend on word length, though: version2a has to allocate a new list the same length as the word, whereas version1a does not. Let's refine version2a further to give it that same advantage, by summing over a generator expression rather than a list comprehension:

def version2b(word):  # 111 ms
    return sum(1 for c in word if c not in chars_set)

To my surprise that was actually slightly counterproductive—but again, that effect may depend on word length.

Finally let's experience the power of .translate() :

chars_str = ''.join(chars_set)

def version3(word):  # 40.7 ms
    return len(word.translate(None, chars_str))

We have a clear winner.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM