简体   繁体   中英

Storing word count in the python trie

I took a list of words and put it into a trie. I would also like to store word count inside for further analysis. What would be the best way to do it? This is the class where I think the frequency would be collected and stored, but I am not sure how to go about it. You can see my attempt, last line in insert is where I try to store the count.

class TrieNode:
    def __init__(self,k):
        self.v = 0
        self.k = k
        self.children = {}
    def all_words(self, prefix):
        if self.end:
            yield prefix
        for letter, child in self.children.items():
            yield from child.all_words(prefix + letter)
class Trie:
    def __init__(self):
        self.root = TrieNode()
    def __init__(self):
        self.root = TrieNode()
    
    def insert(self, word):
        curr = self.root
        for letter in word:
            node = curr.children.get(letter)
            if not node:
                node = TrieNode()
                curr.children[letter] = node
            curr.v += 1

    def insert_many(self, words):
        for word in words:
            self.insert(word)
    def all_words_beginning_with_prefix(self, prefix):
        cur = self.root
        for c in prefix:
            cur = cur.children.get(c)
            if cur is None:
                return  # No words with given prefix
        yield from cur.all_words(prefix)


I want to store the count so that when I use

print(list(trie.all_words_beginning_with_prefix('prefix')))

I would get a result like so:

[(word, count), (word, count)]

While inserting , on seeing any node, it means there's a new word going to be added in that path. Therefore increment your word_count of that node.

class TrieNode:
    def __init__(self, char):
        self.char = char
        self.word_count = 0
        self.children = {}

    def all_words(self, prefix, path):
        if len(self.children) == 0:
            yield prefix + path
        for letter, child in self.children.items():
            yield from child.all_words(prefix, path + letter)


class Trie:
    def __init__(self):
        self.root = TrieNode('')

    def insert(self, word):
        curr = self.root
        for letter in word:
            node = curr.children.get(letter)
            if node is None:
                node = TrieNode(letter)
                curr.children[letter] = node
            curr.word_count += 1  # increment it everytime the node is seen at particular level.
            curr = node

    def insert_many(self, words):
        for word in words:
            self.insert(word)

    def all_words_beginning_with_prefix(self, prefix):
        cur = self.root
        for c in prefix:
            cur = cur.children.get(c)
            if cur is None:
                return  # No words with given prefix
        yield from cur.all_words(prefix, path="")

    def word_count(self, prefix):
        cur = self.root
        for c in prefix:
            cur = cur.children.get(c)
            if cur is None:
                return 0
        return cur.word_count


trie = Trie()
trie.insert_many(["hello", "hi", "random", "heap"])

prefix = "he"
words = [w for w in trie.all_words_beginning_with_prefix(prefix)]

print("Lazy method:\n Prefix: %s, Words: %s, Count: %d" % (prefix, words, len(words)))
print("Proactive method:\n Word count for '%s': %d" % (prefix, trie.word_count(prefix)))

Output:

Lazy method:
 Prefix: he, Words: ['hello', 'heap'], Count: 2
Proactive method:
 Word count for 'he': 2

I would add a field called is_word to the trie node, where is_word would be true only for the last letter in the word. Like you have word AND, is_word would be true for the trie node holding the letter D. And I would update frequency for only nodes that have is_word to be true, not for every letter in the word.

So when you iterate from a letter, check if it is a word, if it is, stop the iteration, return the count and the word. I'm assuming in your iteration you keep track of the letters, and keep adding them to the prefix.

Your trie is a multi-way trie.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM