简体   繁体   English

在带有通配符的集合中查找单词的最快方法

[英]Fastest method of finding words in a collection with wildcard letters

I have ~200,000 words that I need to find matches in with words that can contain any number of letter wildcards.我有大约 200,000 个单词需要找到与可以包含任意数量的字母通配符的单词匹配的单词。 I also need the option to look up words without any wildcards.我还需要查找没有任何通配符的单词的选项。

I've separated the words into collections by length:我已经将单词按长度分成 collections :

static readonly HashSet<string>[] _validWords = { 
    new HashSet<string>(StringComparer.OrdinalIgnoreCase), // 3 letter words
    new HashSet<string>(StringComparer.OrdinalIgnoreCase), // 4 letter words
    new HashSet<string>(StringComparer.OrdinalIgnoreCase), // 5 letter words
    new HashSet<string>(StringComparer.OrdinalIgnoreCase), // 6 letter words
    new HashSet<string>(StringComparer.OrdinalIgnoreCase), // 7 letter words
    new HashSet<string>(StringComparer.OrdinalIgnoreCase), // 8 letter words
    new HashSet<string>(StringComparer.OrdinalIgnoreCase)  // 9 letter words
};

To search for a specific word is simple:搜索特定单词很简单:

public static bool IsValid(string word) {
    return word.Length >= GameplaySettings.Instance.MinWordLength && _validWords[word.Length - 3].Contains(word);
}

This is my current implementation of finding wildcard words, using a Regex (eventually I'd like to get all matching words, but for now just finding one (or none) is fine.):这是我当前使用正则表达式查找通配符的实现(最终我想获取所有匹配的单词,但现在只找到一个(或没有)就可以了。):

public static bool IsValidRegex(string pattern, int length) {
    pattern = $"^{pattern}$";

    foreach (string word in _validWords[length - 3]) {
        if (Regex.Matches(word, pattern, RegexOptions.Singleline).Count > 0) { return true; }
    }

    return false;
}

There can be any number of wildcards (eg all letters can even be wildcards), and it's currently not performing as well as I'd hope.可以有任意数量的通配符(例如,所有字母甚至可以是通配符),而且它目前的性能不如我希望的那样。

So I'm wondering if there is more efficient method!所以我想知道是否有更有效的方法!

Thanks for any help/suggestions!感谢您的任何帮助/建议!

Since your wildcards can only match one letter, the problem isn't too hard.由于您的通配符只能匹配一个字母,因此问题并不难。 If you needed to support variable length substrings, I'd suggest you go and read some of the scientific literature on how regular expressions work.如果您需要支持可变长度子字符串,我建议您使用 go 并阅读一些关于正则表达式如何工作的科学文献。

This is a fairly basic 2nd year comp-sci "data structures and algorithms" exercise.这是一个相当基本的第二年 comp-sci“数据结构和算法”练习。 Using a Dictionary in every Node probably isn't going to be the fastest / most memory efficient.在每个Node中使用Dictionary可能不会是最快/最高效的 memory。 But I would tackle the problem like this;但我会像这样解决这个问题;

class Node
{
    public bool endWord;
    public Dictionary<char, Node> next;
}

public class Words
{
    private Node root = new Node { endWord = false };
    public const char wildcard = '_';

    public void DefineWord(string word)
    {
        var node = root;
        foreach (var c in word)
        {
            if (node.next == null)
                node.next = new Dictionary<char, Node>();
            if (node.next.TryGetValue(c, out var nextNode))
            {
                node = nextNode;
            }
            else
            {
                node = node.next[c] = new Node { endWord = false };
            }
        }
        node.endWord = true;
    }

    private bool IsValid(ReadOnlySpan<char> word, Node node)
    {
        if (word.IsEmpty && node.endWord)
            return true;
        if (node.next == null)
            return false;

        if (word[0] == wildcard)
        {
            word = word.Slice(1);
            foreach(var n in node.next.Values)
            {
                if (IsValid(word, n))
                    return true;
            }
        } else if (node.next.TryGetValue(word[0], out var nextNode))
            return IsValid(word.Slice(1), nextNode);
        return false;
    }

    public bool IsValid(string word)
        => IsValid(word, root);

    public static void Test1()
    {
        var words = new Words();
        words.DefineWord("APE");
        words.DefineWord("APPLE");
        words.DefineWord("BEAR");
        words.DefineWord("BEER");
        words.DefineWord("PEAR");
        words.DefineWord("PEER");
        words.DefineWord("PEERS");

        Assert.True(words.IsValid("APE"));
        Assert.True(words.IsValid("APPLE"));
        Assert.True(words.IsValid("PEAR"));
        Assert.True(words.IsValid("PEER"));
        Assert.True(words.IsValid("PEERS"));
        Assert.True(!words.IsValid("PLIERS"));
        Assert.True(words.IsValid("PE_R"));
        Assert.True(words.IsValid("_EAR"));
        Assert.True(words.IsValid("_E_R"));
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM