[英]Get words from trie containing specific letters
I'd like to retrieve words from a trie that contain specific letters.我想从包含特定字母的 trie 中检索单词。 For example: List all words that contain the letters [a, g].
例如:列出所有包含字母 [a, g] 的单词。 If my trie has the words ["APPLE", "EGG", "CAR", "BLUE", "AGRICULTURE", "DONE"] it would return "AGRICULTURE".
如果我的 trie 有 ["APPLE", "EGG", "CAR", "BLUE", "AGRICULTURE", "DONE"] 的话,它会返回 "AGRICULTURE"。
This is a very simple trie implementation这是一个非常简单的 trie 实现
def load_trie(words):
root = {}
for word in words:
curr_node = root
for letter in word:
curr_node = curr_node.setdefault(letter, {})
curr_node.setdefault('', True)
return root
with open('sowpods') as word_list:
words = [word.strip().upper() for word in word_list]
TRIE = load_trie(words)
If I can check for words containing specific letters, it would also be nice to look for words that don't contain specific letters.如果我可以检查包含特定字母的单词,那么查找不包含特定字母的单词也会很好。
@Mark offers a helpful remark about the dual problem: @Mark 提供了关于双重问题的有用评论:
For words the don't contain the letter you can prune the branches with those keys.
对于不包含字母的单词,您可以使用这些键修剪分支。
Now, how could we make a trie, or any tree, well adapted to the primal problem?现在,我们怎样才能使树或任何树很好地适应原始问题? Let's see.
让我们来看看。 A standard answer for the Anagram Problem is to store sorted letter sets
字谜问题的标准答案是存储排序的字母集
for word in vocabulary:
set_to_word[sorted(word)].append(word)
and then接着
set_to_word.get(sorted(target_word))
will reveal all corresponding anagrams.将显示所有相应的字谜。
To adapt this to a trie, we want frequent letters near the root.为了使其适应 trie,我们希望在根附近出现频繁的字母。 Here is one plausible frequency ordering of the alphabet:
这是字母表的一种合理频率排序:
ETAOINSRHLDCUMFPGWYBV KXJQZ ETAOINSRHLDCUMFPGWYBV KXJQZ
Rather than the sorted( ... )
permutation, permute words by etaoin, and take advantage of the fact that hits will often result in a hit near the root.而不是
sorted( ... )
排列,而是按 etaoin 排列单词,并利用命中通常会导致在根附近命中的事实。 In this scheme "tears" would map to "etasr".在这个方案中,“眼泪”将映射到“etasr”。
For the dual problem, simply store the complement set of letters, obeying the same ordering.对于对偶问题,只需存储字母的补集,遵循相同的顺序。 So "tears" maps to a 21-character string, or perhaps some prefix truncation would suffice.
所以“眼泪”映射到一个 21 个字符的字符串,或者一些前缀截断就足够了。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.