简体   繁体   中英

Trie Backtracking in Recursion

I am building a tree for a spell checker with suggestions. Each node contains a key (a letter) and a value (array of letters down that path).

So assume the following sub-trie in my big trie:

                W
               / \
              a   e
              |   |
              k   k
              |   |
   is word--> e   e
                  |
                 ...

This is just a subpath of a sub-trie. W is a node and a and e are two nodes in its value array etc...

At each node, I check if the next letter in the word is a value of the node. I am trying to support mistyped vowels for now. So 'weke' will yield 'wake' as a suggestion. Here's my searchWord function in my trie:

def searchWord(self, word, path=""):

    if len(word) > 0:
        key = word[0]
        word = word[1:]

        if self.values.has_key(key):
            path = path + key
            nextNode = self.values[key]
            return nextNode.searchWord(word, path)

        else:
             # check here if key is a vowel. If it is, check for other vowel substitutes

    else:
        if self.isWord:
            return path # this is the word found
        else:
            return None

Given 'weke', at the end when word is of length zero and path is 'weke', my code will hit the second big else block. weke is not marked as a word and so it will return with None. This will return out of searchWord with None.

To avoid this, at each stack unwind or recursion backtrack, I need to check if a letter is a vowel and if it is, do the checking again.

I changed the if self.values.has_key(key) loop to the following:

 if self.values.has_key(key):
    path = path + key
    nextNode = self.values[key]
    ret = nextNode.searchWord(word, path)

    if ret == None:
        # check if key == vowel and replace path
        # return nextNode.searchWord(...

    return ret

What am I doing wrong here? What can I do when backtracking to achieve what I'm trying to do?

Search recursively. Keep track of the current index and the original word.

letters = [chr(i) for i in range(97,97+26)]
print letters
max = 300

def searchWord(orig,word, curindex,counter):
    if counter>max: return

    if counter==0:
        s = letters[0] + word[1:]            
        searchWord(orig,s,0,counter+1)
    else:
        c = word[curindex]

        print 'checking ',word,curindex
        s = word
        i = letters.index(c)

        if i==len(letters)-1 and curindex==len(orig)-1:
            print 'done'
            return

        if i==len(letters)-1: 
            print 'end of letters reached'
            print 'curindex',curindex
            s = list(word)
            s[curindex] = list(orig)[curindex]
            s[curindex+1] = letters[0]
            s[1] = letters[0]
            s = ''.join(s)
            searchWord(orig,s,curindex+1,counter+1)

        else:
            s = list(word)
            try:
                s[curindex] = letters[i+1]
            except:
                print '?? ',s,curindex,letters[i]

            s = ''.join(s)
            searchWord(orig,s ,curindex,counter+1)


searchWord("weke","weke",0,0)

I'm not sure recursion and tree-search is the right approach here. If you have a table of words in your memory, the loopkup will be very fast. It is only when the search space is so big, that one has has to split the problem. So the better algorithm will be probably simply something like this:

corpus_words = {'wake',....} # this is in memory
allowed = word in corpus_words # perhaps improve this with adjusted binary search

A typical corpus has 5-30 million words, which is less than 1 Gigabyte. Lookup will be very fast because you can do binary search, which is O(log n) in the average case. The problem with searching for a subset of the word is that you don't know that the typed words is not a word. However you could build allowed vowels. Certain combinations of letters won't be in the corpus. So in terms of computation this problem is pretty easy nowadays. Of course one can quickly improve the simple lookup, by keeping a core corpus in memory, and the rest on disk. Swipe on android works pretty well. It uses a personalized corpus and some machine learning.

What I would do to solve this particular problem, is to calulate neighbours of the word 'weke' and check if they are in the corpus, ie

word = 'weke'             
suggestions = list()                                                      
letters = [chr(x) for x in range(97,97+26)]                                     
for i in range(len(word)):                                                      
    for a in letters: # or do this in a smarter way to iterate                                                           
        newword = word                                                          
        newword[i] = a                                                          
        if newword in corpus: suggestions.append(newword)

And then to improve it, check subsections if they are in a corpus of syllables. There is a lot of work, which has been done on this front so you can probably find standard solutions on the internet, for example: http://nltk.org/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM