简体   繁体   中英

Python: Fastest way to find a string in an enumeration

Parsed the IANA subtag (see Cascaded string split, pythonic way ) and made a list of 8600 tags:

tags= ['aa',
       'ab',
       'ae',
       'af',
       'ak',
       'am',
       'an',
       'ar',
       # ...

I want to check for example mytag="ro" if is in the list: what is the fastest way to do that:

First solution:

if mytag in tags:
    print "found"

Second solution:

if mytag in Set(tags):
    print "found"

Third solution: Transform the list in a big string like: '|aa|ab|ae|af|ak|am|an|ar|...' and then see if string is in another string:

tags = '|aa|ab|ae|af|ak|am|an|ar|...'
if mytag in tags:
    print "found"

Is there another way? Which is the fastest, is this already measured, if not how can I benchmark myself (shoul I take a random element from the list or should I take the last and then test it, can someone provide python code for a 'chronometer')?

As I don't have access to the original string, any test would be biased. However, you asked for a chronometer ? Check the timeit module, designed to time some code snippets.

Note that if you use IPython , %timeit is a magic function that makes it a breeze to time the execution of a function, as illustrated below.

Some comments

  • you should replace Set by set ...
  • construct your set and long string before running any test
  • Taking a random element from your tags list is the way to go indeed.

As an example of use of %timeit in IPython:

tags = ['aa','ab','ae','af','ak','an','ar']
tags_set = set(tags)
tags_str = "|".join(tags)

%timeit 'ro' in tags
1000000 loops, best of 3: 223 ns per loop
%timeit 'ro' in tags_set
1000000 loops, best of 3: 73.5 ns per loop
%timeit 'ro' in tags_str
1000000 loops, best of 3: 98.1 ns per loop

Not related to timings or performance, but you may be able to not worry about this kind of thing earlier on by structuring the data differently.

Looking at your previous post, the answer you accepted contained a function iana_parse that yielded a dict. So, if you know what you're looking for pre-parse time, then you could do:

looking_for = {'ro', 'xx', 'yy', 'zz'}
for res in iana_parse(data): # from previous post
    if res['Subtag'] in looking_for:
        print res['Subtag'], 'was found'

Otherwise (or in combination with), you could build a dict from that function and use that:

subtag_lookup = {rec['Subtag']:rec for rec in iana_parse(data)}

ro = subtag_lookup['ro']
print ro['Description']

At some point if you did just want a list of Subtags, then use:

subtags = list(subtag_lookup)

You can check it on yourself. Just use timeit module..

timeit.Timer() might be useful for you..

Or, you can also use time module : -

import time
ct = time.clock()
if mytag in tags:
    print "found"
print "diff: ", time.clock() - ct

I prefer #1. It should offer you the best performance from the choices you presented too as you aren't doing additional processing to your list prior to the comparison.

As for how to test performance... timeit is what you want.

import timeit
s1 = """
tags= ['aa', 'ab', 'ae', 'af', 'ak', 'am', 'an', 'ar']
mytag = 'ro'
if mytag in tags:
    print 'found'
"""
s2 = """
tags= ['aa', 'ab', 'ae', 'af', 'ak', 'am', 'an', 'ar']
mytag = 'ro'
if mytag in set(tags):
    print 'found'
"""
s3 = """
tags= ['aa', 'ab', 'ae', 'af', 'ak', 'am', 'an', 'ar']
mytag = 'ro'
if mytag in '|'.join(tags):
    print 'found'
"""

print(timeit.Timer(s1, 'gc.enable()').timeit())
print(timeit.Timer(s2, 'gc.enable()').timeit())
print(timeit.Timer(s3, 'gc.enable()').timeit())

>>> 
0.261634511713
0.476344575019
0.282574283666

I've done the tests myself using this code you can use %cpaste in IPython console and paste the code bellow.

#Get IANA language defs
import urllib
import pprint
import timeit
import IPython
import random
f = urllib.urlopen("http://www.iana.org/assignments/language-subtag-registry")
#lan.split("%%") .split("\n").split(":")
lan=f.read()
def iana_parse(data):
    for record in data.split("%%\n"):
        # skip empty records at file endings:
        if not record.strip():
            continue
        rec_data = {}
        for line in record.split("\n"):
#            key, value = line.split(":") doesn't work
            key, value = line.partition(':')[::2]
#            key, _, value = line.partition(':')
            rec_data[key.strip()] = value.strip() 
        yield rec_data

tags =[]

for k in iana_parse(lan):
#    print k
    if "Subtag" in k: tags.append(k["Subtag"])
#maybe store it in a shelve http://docs.python.org/library/shelve.html

tags_set = set(tags)
tags_str = "|".join(tags)
print "Search 'ro'" 
print "List:"
%timeit 'ro' in tags
print "Set:"
%timeit 'ro' in tags_set
print "String:"
%timeit 'ro' in tags_str

random_tag = tags[random.randint(0,len(tags)-1)]
print "Search random" 
print "List:"
%timeit random_tag in tags 
print "Set:"
%timeit random_tag in tags_set 
print "String:"
%timeit random_tag in tags_str

The results are:

Search 'ro'
List: 1000000 loops, best of 3: 1.61 us per loop
Set: 10000000 loops, best of 3: 45.2 ns per loop
String: 1000000 loops, best of 3: 239 ns per loop

Search random
List:10000 loops, best of 3: 36.2 us per loop
Set:10000000 loops, best of 3: 50.9 ns per loop
String:100000 loops, best of 3: 4.88 us per loop

So the order is:

  1. Set is the fastest, if the initialization of the set from the list is already done and is not included in the measuring.
  2. String solution measures the 2nd as speed, also without including the joining in the time measuring.
  3. Surprisingly the list is the last.

选项#1应该是一次性使用最快的,因为它甚至不需要遍历整个列表(构建一个需要通过整个列表的集合),而#2在所有下一次运行中将是最快的(如果你只构建一次set()),因为它将在很小的恒定时间内工作。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM