通过使用Java集合中的通配符返回字符串列表的最快方法

Question

I have set of 100000 String. 我设置了100000 String。 And for example I want to get all strings starting with "JO" from that set. 例如，我想从该集合中获取所有以“ JO”开头的字符串。 What would be the best solution for that? 最好的解决方案是什么？

I was thinking Aho-Corasick but the implementation I have does not support wild cards. 我当时在想Aho-Corasick，但我的实现不支持通配符。

Answer 1

If you want all the strings starting with a sequence you can add all the String into a NavigableSet like TreeSet and get the subSet(text, text+'\') will give you all the entries starting with text This lookup is O(log n) 如果您希望所有字符串都以序列开头，则可以将所有String添加到类似TreeSet的NavigableSet中，并获得subSet(text, text+'\')将给您所有以text开头的条目。此查询为O（log n ）

If you want all the Strings with end with a sequence, you can do a similar thing, except you have to reverse the String. 如果希望所有的字符串都以序列结尾，则可以执行类似的操作，除了必须反转字符串。 In this case a TreeMap from reversed String to forward String would be a better structure. 在这种情况下，从反向字符串到正向字符串的TreeMap将是更好的结构。

If you want "x*z" you can do a search with the first set and take a union with the values of the Map. 如果要“ x * z”，则可以搜索第一组并与Map的值合并。

if you want contains " x ", you can use a Navigable<String, Set<String>> where the key is each String starting from the first, second, third char etc The value is a Set as you can get duplicates. 如果要包含“ x ”，则可以使用Navigable <String，Set <String >>，其中键是从第一个，第二个，第三个字符开始的每个String等。该值是一个Set，因为您可以获取重复项。 You can do a search like the starts with structure. 您可以进行类似结构开头的搜索。

Answer 2

Here's a custom matcher class that does the matching without regular expressions (it only uses regex in the constructor, to put it more precisely) and supports wildcard matching: 这是一个自定义匹配器类，该类无需进行正则表达式即可进行匹配（它仅在构造函数中使用regex，以更准确地说明它）并支持通配符匹配：

public class WildCardMatcher {
    private Iterable<String> patternParts;
    private boolean openStart;
    private boolean openEnd;

    public WildCardMatcher(final String pattern) {
        final List<String> tmpList = new ArrayList<String>(
                                     Arrays.asList(pattern.split("\\*")));
        while (tmpList.remove("")) { /* remove empty Strings */ }
        // these last two lines can be made a lot simpler using a Guava Joiner
        if (tmpList.isEmpty())
            throw new IllegalArgumentException("Invalid pattern");
        patternParts = tmpList;
        openStart = pattern.startsWith("*");
        openEnd = pattern.endsWith("*");
    }

    public boolean matches(final String item) {
        int index = -1;
        int nextIndex = -1;
        final Iterator<String> it = patternParts.iterator();
        if (it.hasNext()) {
            String part = it.next();
            index = item.indexOf(part);
            if (index < 0 || (index > 0 && !openStart))
                return false;
            nextIndex = index + part.length();
            while (it.hasNext()) {
                part = it.next();
                index = item.indexOf(part, nextIndex);
                if (index < 0)
                    return false;
                nextIndex = index + part.length();
            }
            if (nextIndex < item.length())
                return openEnd;
        }
        return true;
    }

}

Here's some test code: 这是一些测试代码：

public static void main(final String[] args) throws Exception {
    testMatch("foo*bar", "foobar", "foo123bar", "foo*bar", "foobarandsomethingelse");
    testMatch("*.*", "somefile.doc", "somefile", ".doc", "somefile.");
    testMatch("pe*", "peter", "antipeter");
}

private static void testMatch(final String pattern, final String... words) {
    final WildCardMatcher matcher = new WildCardMatcher(pattern);
    for (final String word : words) {
        System.out.println("Pattern " + pattern + " matches word '"
                          + word + "': " + matcher.matches(word));
    }
}

Output: 输出：

Pattern foo*bar matches word 'foobar': true
Pattern foo*bar matches word 'foo123bar': true
Pattern foo*bar matches word 'foo*bar': true
Pattern foo*bar matches word 'foobarandsomethingelse': false
Pattern *.* matches word 'somefile.doc': true
Pattern *.* matches word 'somefile': false
Pattern *.* matches word '.doc': true
Pattern *.* matches word 'somefile.': true
Pattern pe* matches word 'peter': true
Pattern pe* matches word 'antipeter': false

While this is far from being production-ready, it should be fast enough and it supports multiple wild cards (including in the first and last place). 尽管这还远远不能投入生产，但它应该足够快，并且支持多个通配符（包括开头和结尾）。 But of course if your wildcards are only at the end, use Peter's answer (+1). 但是，当然，如果您的通配符仅在末尾，请使用彼得的答案（+1）。

通过使用Java集合中的通配符返回字符串列表的最快方法

问题描述

2 个解决方案

解决方案1
10 已采纳 2011-05-10 15:41:24

解决方案2
2 2011-05-10 16:08:33

通过使用Java集合中的通配符返回字符串列表的最快方法

问题描述

2 个解决方案

解决方案1 10 已采纳 2011-05-10 15:41:24

解决方案2 2 2011-05-10 16:08:33

解决方案1
10 已采纳 2011-05-10 15:41:24

解决方案2
2 2011-05-10 16:08:33