I'm interested in tries and DAWGs (direct acyclic word graph) and I've been reading a lot about them but I don't understand what should the output trie or DAWG file look like.
-
or space?I want to understand the best output structure in order to figure out how to create and use one.
I would also appreciate what should be the output of a DAWG along with trie .
I do not want to see graphical representations with bubbles linked to each other, I want to know the output object once a set of words are turned into tries or DAWGs.
Unwind is essentially correct that there are many different ways to implement a trie; and for a large, scalable trie, nested dictionaries might become cumbersome -- or at least space inefficient. But since you're just getting started, I think that's the easiest approach; you could code up a simple trie
in just a few lines. First, a function to construct the trie:
>>> _end = '_end_'
>>>
>>> def make_trie(*words):
... root = dict()
... for word in words:
... current_dict = root
... for letter in word:
... current_dict = current_dict.setdefault(letter, {})
... current_dict[_end] = _end
... return root
...
>>> make_trie('foo', 'bar', 'baz', 'barz')
{'b': {'a': {'r': {'_end_': '_end_', 'z': {'_end_': '_end_'}},
'z': {'_end_': '_end_'}}},
'f': {'o': {'o': {'_end_': '_end_'}}}}
If you're not familiar with setdefault
, it simply looks up a key in the dictionary (here, letter
or _end
). If the key is present, it returns the associated value; if not, it assigns a default value to that key and returns the value ( {}
or _end
). (It's like a version of get
that also updates the dictionary.)
Next, a function to test whether the word is in the trie:
>>> def in_trie(trie, word):
... current_dict = trie
... for letter in word:
... if letter not in current_dict:
... return False
... current_dict = current_dict[letter]
... return _end in current_dict
...
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'baz')
True
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barz')
True
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barzz')
False
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'bart')
False
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'ba')
False
I'll leave insertion and removal to you as an exercise.
Of course, Unwind's suggestion wouldn't be much harder. There might be a slight speed disadvantage in that finding the correct sub-node would require a linear search. But the search would be limited to the number of possible characters -- 27 if we include _end
. Also, there's nothing to be gained by creating a massive list of nodes and accessing them by index as he suggests; you might as well just nest the lists.
Finally, I'll add that creating a directed acyclic word graph (DAWG) would be a bit more complex, because you have to detect situations in which your current word shares a suffix with another word in the structure. In fact, this can get rather complex, depending on how you want to structure the DAWG! You may have to learn some stuff about Levenshtein distance to get it right.
Have a look at this:
https://github.com/kmike/marisa-trie
Static memory-efficient Trie structures for Python (2.x and 3.x).
String data in a MARISA-trie may take up to 50x-100x less memory than in a standard Python dict; the raw lookup speed is comparable; trie also provides fast advanced methods like prefix search.
Based on marisa-trie C++ library.
Here's a blog post from a company using marisa trie successfully:
https://www.repustate.com/blog/sharing-large-data-structure-across-processes-python/
At Repustate, much of our data models we use in our text analysis can be represented as simple key-value pairs, or dictionaries in Python lingo. In our particular case, our dictionaries are massive, a few hundred MB each, and they need to be accessed constantly. In fact for a given HTTP request, 4 or 5 models might be accessed, each doing 20-30 lookups. So the problem we face is how do we keep things fast for the client as well as light as possible for the server.
...
I found this package, marisa tries, which is a Python wrapper around a C++ implementation of a marisa trie. “Marisa” is an acronym for Matching Algorithm with Recursively Implemented StorAge. What's great about marisa tries is the storage mechanism really shrinks how much memory you need. The author of the Python plugin claimed 50-100X reduction in size – our experience is similar.
What's great about the marisa trie package is that the underlying trie structure can be written to disk and then read in via a memory mapped object. With a memory mapped marisa trie, all of our requirements are now met. Our server's memory usage went down dramatically, by about 40%, and our performance was unchanged from when we used Python's dictionary implementation.
There are also a couple of pure-python implementations, though unless you're on a restricted platform you'd want to use the C++ backed implementation above for best performance:
Here is a list of python packages that implement Trie:
Modified from senderle
's method (above). I found that Python's defaultdict
is ideal for creating a trie or a prefix tree.
from collections import defaultdict
class Trie:
"""
Implement a trie with insert, search, and startsWith methods.
"""
def __init__(self):
self.root = defaultdict()
# @param {string} word
# @return {void}
# Inserts a word into the trie.
def insert(self, word):
current = self.root
for letter in word:
current = current.setdefault(letter, {})
current.setdefault("_end")
# @param {string} word
# @return {boolean}
# Returns if the word is in the trie.
def search(self, word):
current = self.root
for letter in word:
if letter not in current:
return False
current = current[letter]
if "_end" in current:
return True
return False
# @param {string} prefix
# @return {boolean}
# Returns if there is any word in the trie
# that starts with the given prefix.
def startsWith(self, prefix):
current = self.root
for letter in prefix:
if letter not in current:
return False
current = current[letter]
return True
# Now test the class
test = Trie()
test.insert('helloworld')
test.insert('ilikeapple')
test.insert('helloz')
print test.search('hello')
print test.startsWith('hello')
print test.search('ilikeapple')
There's no "should"; it's up to you. Various implementations will have different performance characteristics, take various amounts of time to implement, understand, and get right. This is typical for software development as a whole, in my opinion.
I would probably first try having a global list of all trie nodes so far created, and representing the child-pointers in each node as a list of indices into the global list. Having a dictionary just to represent the child linking feels too heavy-weight, to me.
from collections import defaultdict
_trie = lambda: defaultdict(_trie)
trie = _trie()
for s in ["cat", "bat", "rat", "cam"]:
curr = trie
for c in s:
curr = curr[c]
curr.setdefault("_end")
def word_exist(trie, word):
curr = trie
for w in word:
if w not in curr:
return False
curr = curr[w]
return '_end' in curr
print(word_exist(trie, 'cam'))
Here is full code using a TrieNode class. Also implemented auto_complete method to return the matching words with a prefix.
Since we are using dictionary to store children, there is no need to convert char to integer and vice versa and don't need to allocate array memory in advance.
class TrieNode:
def __init__(self):
#Dict: Key = letter, Item = TrieNode
self.children = {}
self.end = False
class Trie:
def __init__(self):
self.root = TrieNode()
def build_trie(self,words):
for word in words:
self.insert(word)
def insert(self,word):
node = self.root
for char in word:
if char not in node.children:
node.children[char] = TrieNode()
node = node.children[char]
node.end = True
def search(self, word):
node = self.root
for char in word:
if char in node.children:
node = node.children[char]
else:
return False
return node.end
def _walk_trie(self, node, word, word_list):
if node.children:
for char in node.children:
word_new = word + char
if node.children[char].end:
# if node.end:
word_list.append( word_new)
# word_list.append( word)
self._walk_trie(node.children[char], word_new , word_list)
def auto_complete(self, partial_word):
node = self.root
word_list = [ ]
#find the node for last char of word
for char in partial_word:
if char in node.children:
node = node.children[char]
else:
# partial_word not found return
return word_list
if node.end:
word_list.append(partial_word)
# word_list will be created in this method for suggestions that start with partial_word
self._walk_trie(node, partial_word, word_list)
return word_list
create a Trie
t = Trie()
words = ['hi', 'hieght', 'rat', 'ram', 'rattle', 'hill']
t.build_trie(words)
Search for word
words = ['hi', 'hello']
for word in words:
print(word, t.search(word))
hi True
hel False
search for words using prefix
partial_word = 'ra'
t.auto_complete(partial_word)
['rat', 'rattle', 'ram']
If you want a TRIE implemented as a Python class, here is something I wrote after reading about them:
class Trie:
def __init__(self):
self.__final = False
self.__nodes = {}
def __repr__(self):
return 'Trie<len={}, final={}>'.format(len(self), self.__final)
def __getstate__(self):
return self.__final, self.__nodes
def __setstate__(self, state):
self.__final, self.__nodes = state
def __len__(self):
return len(self.__nodes)
def __bool__(self):
return self.__final
def __contains__(self, array):
try:
return self[array]
except KeyError:
return False
def __iter__(self):
yield self
for node in self.__nodes.values():
yield from node
def __getitem__(self, array):
return self.__get(array, False)
def create(self, array):
self.__get(array, True).__final = True
def read(self):
yield from self.__read([])
def update(self, array):
self[array].__final = True
def delete(self, array):
self[array].__final = False
def prune(self):
for key, value in tuple(self.__nodes.items()):
if not value.prune():
del self.__nodes[key]
if not len(self):
self.delete([])
return self
def __get(self, array, create):
if array:
head, *tail = array
if create and head not in self.__nodes:
self.__nodes[head] = Trie()
return self.__nodes[head].__get(tail, create)
return self
def __read(self, name):
if self.__final:
yield name
for key, value in self.__nodes.items():
yield from value.__read(name + [key])
Using defaultdict and reduce function.
Create Trie
from functools import reduce
from collections import defaultdict
T = lambda : defaultdict(T)
trie = T()
reduce(dict.__getitem__,'how',trie)['isEnd'] = True
Trie :
defaultdict(<function __main__.<lambda>()>,
{'h': defaultdict(<function __main__.<lambda>()>,
{'o': defaultdict(<function __main__.<lambda>()>,
{'w': defaultdict(<function __main__.<lambda>()>,
{'isEnd': True})})})})
Search In Trie :
curr = trie
for w in 'how':
if w in curr:
curr = curr[w]
else:
print("Not Found")
break
if curr['isEnd']:
print('Found')
This version is using recursion
import pprint
from collections import deque
pp = pprint.PrettyPrinter(indent=4)
inp = raw_input("Enter a sentence to show as trie\n")
words = inp.split(" ")
trie = {}
def trie_recursion(trie_ds, word):
try:
letter = word.popleft()
out = trie_recursion(trie_ds.get(letter, {}), word)
except IndexError:
# End of the word
return {}
# Dont update if letter already present
if not trie_ds.has_key(letter):
trie_ds[letter] = out
return trie_ds
for word in words:
# Go through each word
trie = trie_recursion(trie, deque(word))
pprint.pprint(trie)
Output:
Coool👾 <algos>🚸 python trie.py
Enter a sentence to show as trie
foo bar baz fun
{
'b': {
'a': {
'r': {},
'z': {}
}
},
'f': {
'o': {
'o': {}
},
'u': {
'n': {}
}
}
}
Trie Data Structure can be used to store data in O(L)
where L is the length of the string so for inserting N strings time complexity would be O(NL)
the string can be searched in O(L)
only same goes for deletion.
Can be clone from https://github.com/Parikshit22/pytrie.git
class Node:
def __init__(self):
self.children = [None]*26
self.isend = False
class trie:
def __init__(self,):
self.__root = Node()
def __len__(self,):
return len(self.search_byprefix(''))
def __str__(self):
ll = self.search_byprefix('')
string = ''
for i in ll:
string+=i
string+='\n'
return string
def chartoint(self,character):
return ord(character)-ord('a')
def remove(self,string):
ptr = self.__root
length = len(string)
for idx in range(length):
i = self.chartoint(string[idx])
if ptr.children[i] is not None:
ptr = ptr.children[i]
else:
raise ValueError("Keyword doesn't exist in trie")
if ptr.isend is not True:
raise ValueError("Keyword doesn't exist in trie")
ptr.isend = False
return
def insert(self,string):
ptr = self.__root
length = len(string)
for idx in range(length):
i = self.chartoint(string[idx])
if ptr.children[i] is not None:
ptr = ptr.children[i]
else:
ptr.children[i] = Node()
ptr = ptr.children[i]
ptr.isend = True
def search(self,string):
ptr = self.__root
length = len(string)
for idx in range(length):
i = self.chartoint(string[idx])
if ptr.children[i] is not None:
ptr = ptr.children[i]
else:
return False
if ptr.isend is not True:
return False
return True
def __getall(self,ptr,key,key_list):
if ptr is None:
key_list.append(key)
return
if ptr.isend==True:
key_list.append(key)
for i in range(26):
if ptr.children[i] is not None:
self.__getall(ptr.children[i],key+chr(ord('a')+i),key_list)
def search_byprefix(self,key):
ptr = self.__root
key_list = []
length = len(key)
for idx in range(length):
i = self.chartoint(key[idx])
if ptr.children[i] is not None:
ptr = ptr.children[i]
else:
return None
self.__getall(ptr,key,key_list)
return key_list
t = trie()
t.insert("shubham")
t.insert("shubhi")
t.insert("minhaj")
t.insert("parikshit")
t.insert("pari")
t.insert("shubh")
t.insert("minakshi")
print(t.search("minhaj"))
print(t.search("shubhk"))
print(t.search_byprefix('m'))
print(len(t))
print(t.remove("minhaj"))
print(t)
True
False
['minakshi', 'minhaj']
7
minakshi
minhajsir
pari
parikshit
shubh
shubham
shubhi
This is much like a previous answer but simpler to read:
def make_trie(words):
trie = {}
for word in words:
head = trie
for char in word:
if char not in head:
head[char] = {}
head = head[char]
head["_end_"] = "_end_"
return trie
class Trie:
head = {}
def add(self,word):
cur = self.head
for ch in word:
if ch not in cur:
cur[ch] = {}
cur = cur[ch]
cur['*'] = True
def search(self,word):
cur = self.head
for ch in word:
if ch not in cur:
return False
cur = cur[ch]
if '*' in cur:
return True
else:
return False
def printf(self):
print (self.head)
dictionary = Trie()
dictionary.add("hi")
#dictionary.add("hello")
#dictionary.add("eye")
#dictionary.add("hey")
print(dictionary.search("hi"))
print(dictionary.search("hello"))
print(dictionary.search("hel"))
print(dictionary.search("he"))
dictionary.printf()
Out
True
False
False
False
{'h': {'i': {'*': True}}}
Here is @senderle's answer , slightly modified to accept prefix search (and not only whole-word matching):
_end = '_end_'
def make_trie(words):
root = dict()
for word in words:
current_dict = root
for letter in word:
current_dict = current_dict.setdefault(letter, {})
current_dict[_end] = _end
return root
def in_trie(trie, word):
current_dict = trie
for letter in word:
if _end in current_dict:
return True
if letter not in current_dict:
return False
current_dict = current_dict[letter]
t = make_trie(['hello', 'hi', 'foo', 'bar'])
print(in_trie(t, 'hello world'))
# True
class TrieNode:
def __init__(self):
self.keys = {}
self.end = False
class Trie:
def __init__(self):
self.root = TrieNode()
def insert(self, word: str, node=None) -> None:
if node == None:
node = self.root
# insertion is a recursive operation
# this is base case to exit the recursion
if len(word) == 0:
node.end = True
return
# if this key does not exist create a new node
elif word[0] not in node.keys:
node.keys[word[0]] = TrieNode()
self.insert(word[1:], node.keys[word[0]])
# that means key exists
else:
self.insert(word[1:], node.keys[word[0]])
def search(self, word: str, node=None) -> bool:
if node == None:
node = self.root
# this is positive base case to exit the recursion
if len(word) == 0 and node.end == True:
return True
elif len(word) == 0:
return False
elif word[0] not in node.keys:
return False
else:
return self.search(word[1:], node.keys[word[0]])
def startsWith(self, prefix: str, node=None) -> bool:
if node == None:
node = self.root
if len(prefix) == 0:
return True
elif prefix[0] not in node.keys:
return False
else:
return self.startsWith(prefix[1:], node.keys[prefix[0]])
In response to @basj
The following code will capture \b
(end of word) letters.
_end = '_end_'
def make_trie(words):
root = dict()
for word in words:
current_dict = root
for letter in word:
current_dict = current_dict.setdefault(letter, {})
current_dict[_end] = _end
return root
def in_trie(trie, word):
current_dict = trie
for letter in word:
if letter not in current_dict: # Adjusted the
return False # order of letter
if _end in current_dict[letter]: # checks to capture
return True # the last letter.
current_dict = current_dict[letter]
t = make_trie(['hello', 'hi', 'foo', 'bar'])
>>> print(in_trie(t, 'hi'))
True
>>> print(in_trie(t, 'hola'))
False
>>> print(in_trie(t, 'hello friend'))
True
>>> print(in_trie(t, 'hel'))
None
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.