简体   繁体   中英

Fastest way in Python to find a 'startswith' substring in a long sorted list of strings

I've done a lot of Googling, but haven't found anything, so I'm really sorry if I'm just searching for the wrong things.

I am writing an implementation of the Ghost for MIT Introduction to Programming, assignment 5 .

As part of this, I need to determine whether a string of characters is the start of any valid word. I have a list of valid words ("wordlist").

Update: I could use something that iterated through the list each time, such as Peter's simple suggestion:

def word_exists(wordlist, word_fragment):
return any(w.startswith(word_fragment) for w in wordlist)

I previously had:

wordlist = [w for w in wordlist if w.startswith(word_fragment)]

(from here ) to narrow the list down to the list of valid words that start with that fragment and consider it a loss if wordlist is empty. The reason that I took this approach was that I (incorrectly, see below) thought that this would save time, as subsequent lookups would only have to search a smaller list.

It occurred to me that this is going through each item in the original wordlist (38,000-odd words) checking the start of each. This seems silly when wordlist is ordered, and the comprehension could stop once it hits something that is after the word fragment. I tried this:

newlist = []
for w in wordlist:
    if w[:len(word_fragment)] > word_fragment:
        # Take advantage of the fact that the list is sorted
        break
    if w.startswith(word_fragment):
        newlist.append(w)
return newlist

but that is about the same speed, which I thought may be because list comprehensions run as compiled code?

I then thought that more efficient again would be some form of binary search in the list to find the block of matching words. Is this the way to go, or am I missing something really obvious?

Clearly it isn't really a big deal in this case, but I'm just starting out with programming and want to do things properly.

UPDATE:

I have since tested the below suggestions with a simple test script . While Peter's binary search/bisect would clearly be better for a single run, I was interested in whether the narrowing list would win over a series of fragments. In fact, it did not:

The totals for all strings "p", "py", "pyt", "pyth", "pytho" are as follows:
In total, Peter's simple test took 0.175472736359
In total, Peter's bisect left test took 9.36985015869e-05
In total, the list comprehension took 0.0499348640442
In total, Neil G's bisect took 0.000373601913452

The overhead of creating a second list etc clearly took more time than searching the longer list. In hindsight, this was likely the best approach regardless, as the "reducing list" approach increased the time for the first run, which was the worst case scenario.

Thanks all for some excellent suggestions, and well done Peter for the best answer!!!

Generator expressions are evaluated lazily, so if you only need to determine whether or not your word is valid, I would expect the following to be more efficient since it doesn't necessarily force it to build the full list once it finds a match:

def word_exists(wordlist, word_fragment):
    return any(w.startswith(word_fragment) for w in wordlist)

Note that the lack of square brackets is important for this to work.

However this is obviously still linear in the worst case. You're correct that binary search would be more efficient; you can use the built-in bisect module for that. It might look something like this:

from bisect import bisect_left
def word_exists(wordlist, word_fragment):
    try:
        return wordlist[bisect_left(wordlist, word_fragment)].startswith(word_fragment)
    except IndexError:
        return False # word_fragment is greater than all entries in wordlist

bisect_left runs in O(log(n)) so is going to be considerably faster for a large wordlist.

Edit: I would guess that the example you gave loses out if your word_fragment is something really common (like 't'), in which case it probably spends most of its time assembling a large list of valid words, and the gain from only having to do a partial scan of the list is negligible. Hard to say for sure, but it's a little academic since binary search is better anyway.

You're right that you can do this more efficiently given that the list is sorted.

I'm building off of @Peter's answer, which returns a single element. I see that you want all the words that start with a given prefix. Here's how you do that:

from bisect import bisect_left
wordlist[bisect_left(wordlist, word_fragment):
         bisect_left(wordlist, word_fragment[:-1] + chr(ord(word_fragment[-1])+1))]

This returns the slice from your original sorted list.

As Peter suggested I would use the Bisect module. Especially if you're reading from a large file of words.

If you really need speed you could make a daemon ( How do you create a daemon in Python? ) that has a pre-processed data structure suited for the task

I suggest you could use "tries"

http://www.topcoder.com/tc?module=Static&d1=tutorials&d2=usingTries

There are many algorithms and data structures to index and search strings inside a text, some of them are included in the standard libraries, but not all of them; the trie data structure is a good example of one that isn't.

Let word be a single string and let dictionary be a large set of words. If we have a dictionary, and we need to know if a single word is inside of the dictionary the tries are a data structure that can help us. But you may be asking yourself, "Why use tries if set and hash tables can do the same?" There are two main reasons:

  • The tries can insert and find strings in O(L) time (where L represent the length of a single word). This is much faster than set, but is it a bit faster than a hash table.
  • The set and the hash tables can only find in a dictionary words that match exactly with the single word that we are finding; the trie allow us to find words that have a single character different, a prefix in common, a character missing, etc.

The tries can be useful in TopCoder problems, but also have a great amount of applications in software engineering. For example, consider a web browser. Do you know how the web browser can auto complete your text or show you many possibilities of the text that you could be writing? Yes, with the trie you can do it very fast. Do you know how an orthographic corrector can check that every word that you type is in a dictionary? Again a trie. You can also use a trie for suggested corrections of the words that are present in the text but not in the dictionary.

an example would be:

start={'a':nodea,'b':nodeb,'c':nodec...}
nodea={'a':nodeaa,'b':nodeab,'c':nodeac...}
nodeb={'a':nodeba,'b':nodebb,'c':nodebc...}
etc..

then if you want all the words starting with ab you would just traverse start['a']['b'] and that would be all the words you want.

to build it you could iterate through your wordlist and for each word, iterate through the characters adding a new default dict where required.

In case of binary search (assuming wordlist is sorted), I'm thinking of something like this:

wordlist = "ab", "abc", "bc", "bcf", "bct", "cft", "k", "l", "m"
fragment = "bc"
a, m, b = 0, 0, len(wordlist)-1
iterations = 0

while True:
    if (a + b) / 2 == m: break # endless loop = nothing found
    m = (a + b) / 2
    iterations += 1
    if wordlist[m].startswith(fragment): break # found word
    if wordlist[m] > fragment >= wordlist[a]: a, b = a, m
    elif wordlist[b] >= fragment >= wordlist[m]: a, b = m, b

if wordlist[m].startswith(fragment):
    print wordlist[m], iterations
else:
    print "Not found", iterations

It will find one matched word, or none. You will then have to look to the left and right of it to find other matched words. My algorithm might be incorrect, its just a rough version of my thoughts.

Here's my fastest way to narrow the list wordlist down to a list of valid words starting with a given fragment:

sect() is a generator function that uses the excellent Peter's idea to employ bisect , and the islice() function:

from bisect import bisect_left
from itertools import islice

from time import clock

A,B = [],[]


iterations = 5
repetition = 10

with open('words.txt') as f:
    wordlist = f.read().split()

wordlist.sort()
print 'wordlist[0:10]==',wordlist[0:10]


def sect(wordlist,word_fragment):
    lgth = len(word_fragment)
    for w in islice(wordlist,bisect_left(wordlist, word_fragment),None):
        if w[0:lgth]==word_fragment:
            yield w
        else:
            break


def hooloo(wordlist,word_fragment):
    usque = len(word_fragment)
    for w in wordlist:
        if w[:usque] > word_fragment:
            break
        if w.startswith(word_fragment):
            yield w


for rep in xrange(repetition):
    te = clock()
    for i in xrange(iterations):
        newlistA = list(sect(wordlist,'VEST'))
    A.append(clock()-te)

    te = clock()
    for i in xrange(iterations):
        newlistB = list(hooloo(wordlist,'VEST'))
    B.append(clock() - te)


print '\niterations =',iterations,'   number of tries:',repetition,'\n'
print newlistA,'\n',min(A),'\n'
print newlistB,'\n',min(B),'\n'

result

wordlist[0:10]== ['AA', 'AAH', 'AAHED', 'AAHING', 'AAHS', 'AAL', 'AALII', 'AALIIS', 'AALS', 'AARDVARK']

iterations = 5    number of tries: 30 

['VEST', 'VESTA', 'VESTAL', 'VESTALLY', 'VESTALS', 'VESTAS', 'VESTED', 'VESTEE', 'VESTEES', 'VESTIARY', 'VESTIGE', 'VESTIGES', 'VESTIGIA', 'VESTING', 'VESTINGS', 'VESTLESS', 'VESTLIKE', 'VESTMENT', 'VESTRAL', 'VESTRIES', 'VESTRY', 'VESTS', 'VESTURAL', 'VESTURE', 'VESTURED', 'VESTURES'] 
0.0286089433154 

['VEST', 'VESTA', 'VESTAL', 'VESTALLY', 'VESTALS', 'VESTAS', 'VESTED', 'VESTEE', 'VESTEES', 'VESTIARY', 'VESTIGE', 'VESTIGES', 'VESTIGIA', 'VESTING', 'VESTINGS', 'VESTLESS', 'VESTLIKE', 'VESTMENT', 'VESTRAL', 'VESTRIES', 'VESTRY', 'VESTS', 'VESTURAL', 'VESTURE', 'VESTURED', 'VESTURES'] 
0.415578236899

sect() is 14.5 times faster than holloo()

PS:

I know the existence of timeit , but here, for such a result, clock() is fully sufficient

Doing binary search in the list is not going to guarantee you anything. I am not sure how that would work either.

You have a list which is ordered, it is a good news. The algorithmic performance complexity of both your cases is O(n) which is not bad, that you just have to iterate through the whole wordlist once.

But in the second case, the performance (engineering performance) should be better because you are breaking as soon as you find that rest cases will not apply. Try to have a list where 1st element is match and rest 38000 - 1 elements do not match, you will the second will beat the first.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM