简体   繁体   English

从 trie 中获取包含特定字母的单词

[英]Get words from trie containing specific letters

I'd like to retrieve words from a trie that contain specific letters.我想从包含特定字母的 trie 中检索单词。 For example: List all words that contain the letters [a, g].例如:列出所有包含字母 [a, g] 的单词。 If my trie has the words ["APPLE", "EGG", "CAR", "BLUE", "AGRICULTURE", "DONE"] it would return "AGRICULTURE".如果我的 trie 有 ["APPLE", "EGG", "CAR", "BLUE", "AGRICULTURE", "DONE"] 的话,它会返回 "AGRICULTURE"。

This is a very simple trie implementation这是一个非常简单的 trie 实现

def load_trie(words):
    root = {}
    for word in words:
        curr_node = root
        for letter in word:
            curr_node = curr_node.setdefault(letter, {})
        curr_node.setdefault('', True)
    return root

with open('sowpods') as word_list:
    words = [word.strip().upper() for word in word_list]
    
TRIE = load_trie(words)

If I can check for words containing specific letters, it would also be nice to look for words that don't contain specific letters.如果我可以检查包含特定字母的单词,那么查找包含特定字母的单词也会很好。

@Mark offers a helpful remark about the dual problem: @Mark 提供了关于双重问题的有用评论:

For words the don't contain the letter you can prune the branches with those keys.对于包含字母的单词,您可以使用这些键修剪分支。

Now, how could we make a trie, or any tree, well adapted to the primal problem?现在,我们怎样才能使树或任何树很好地适应原始问题? Let's see.让我们来看看。 A standard answer for the Anagram Problem is to store sorted letter sets字谜问题的标准答案是存储排序的字母集

for word in vocabulary:
    set_to_word[sorted(word)].append(word)

and then接着

set_to_word.get(sorted(target_word))

will reveal all corresponding anagrams.将显示所有相应的字谜。

To adapt this to a trie, we want frequent letters near the root.为了使其适应 trie,我们希望在根附近出现频繁的字母。 Here is one plausible frequency ordering of the alphabet:这是字母表的一种合理频率排序:

ETAOINSRHLDCUMFPGWYBV KXJQZ ETAOINSRHLDCUMFPGWYBV KXJQZ

Rather than the sorted( ... ) permutation, permute words by etaoin, and take advantage of the fact that hits will often result in a hit near the root.而不是sorted( ... )排列,而是按 etaoin 排列单词,并利用命中通常会导致在根附近命中的事实。 In this scheme "tears" would map to "etasr".在这个方案中,“眼泪”将映射到“etasr”。

For the dual problem, simply store the complement set of letters, obeying the same ordering.对于对偶问题,只需存储字母的补集,遵循相同的顺序。 So "tears" maps to a 21-character string, or perhaps some prefix truncation would suffice.所以“眼泪”映射到一个 21 个字符的字符串,或者一些前缀截断就足够了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM