简体   繁体   English

通过使用Java集合中的通配符返回字符串列表的最快方法

[英]Quickest way to return list of Strings by using wildcard from collection in Java

I have set of 100000 String. 我设置了100000 String。 And for example I want to get all strings starting with "JO" from that set. 例如,我想从该集合中获取所有以“ JO”开头的字符串。 What would be the best solution for that? 最好的解决方案是什么?

I was thinking Aho-Corasick but the implementation I have does not support wild cards. 我当时在想Aho-Corasick,但我的实现不支持通配符。

If you want all the strings starting with a sequence you can add all the String into a NavigableSet like TreeSet and get the subSet(text, text+'\￿') will give you all the entries starting with text This lookup is O(log n) 如果您希望所有字符串都以序列开头,则可以将所有String添加到类似TreeSet的NavigableSet中,并获得subSet(text, text+'\￿')将给您所有以text开头的条目。此查询为O(log n )


If you want all the Strings with end with a sequence, you can do a similar thing, except you have to reverse the String. 如果希望所有的字符串都以序列结尾,则可以执行类似的操作,除了必须反转字符串。 In this case a TreeMap from reversed String to forward String would be a better structure. 在这种情况下,从反向字符串到正向字符串的TreeMap将是更好的结构。

If you want "x*z" you can do a search with the first set and take a union with the values of the Map. 如果要“ x * z”,则可以搜索第一组并与Map的值合并。

if you want contains " x ", you can use a Navigable<String, Set<String>> where the key is each String starting from the first, second, third char etc The value is a Set as you can get duplicates. 如果要包含“ x ”,则可以使用Navigable <String,Set <String >>,其中键是从第一个,第二个,第三个字符开始的每个String等。该值是一个Set,因为您可以获取重复项。 You can do a search like the starts with structure. 您可以进行类似结构开头的搜索。

Here's a custom matcher class that does the matching without regular expressions (it only uses regex in the constructor, to put it more precisely) and supports wildcard matching: 这是一个自定义匹配器类,该类无需进行正则表达式即可进行匹配(它仅在构造函数中使用regex,以更准确地说明它)并支持通配符匹配:

public class WildCardMatcher {
    private Iterable<String> patternParts;
    private boolean openStart;
    private boolean openEnd;

    public WildCardMatcher(final String pattern) {
        final List<String> tmpList = new ArrayList<String>(
                                     Arrays.asList(pattern.split("\\*")));
        while (tmpList.remove("")) { /* remove empty Strings */ }
        // these last two lines can be made a lot simpler using a Guava Joiner
        if (tmpList.isEmpty())
            throw new IllegalArgumentException("Invalid pattern");
        patternParts = tmpList;
        openStart = pattern.startsWith("*");
        openEnd = pattern.endsWith("*");
    }

    public boolean matches(final String item) {
        int index = -1;
        int nextIndex = -1;
        final Iterator<String> it = patternParts.iterator();
        if (it.hasNext()) {
            String part = it.next();
            index = item.indexOf(part);
            if (index < 0 || (index > 0 && !openStart))
                return false;
            nextIndex = index + part.length();
            while (it.hasNext()) {
                part = it.next();
                index = item.indexOf(part, nextIndex);
                if (index < 0)
                    return false;
                nextIndex = index + part.length();
            }
            if (nextIndex < item.length())
                return openEnd;
        }
        return true;
    }

}

Here's some test code: 这是一些测试代码:

public static void main(final String[] args) throws Exception {
    testMatch("foo*bar", "foobar", "foo123bar", "foo*bar", "foobarandsomethingelse");
    testMatch("*.*", "somefile.doc", "somefile", ".doc", "somefile.");
    testMatch("pe*", "peter", "antipeter");
}

private static void testMatch(final String pattern, final String... words) {
    final WildCardMatcher matcher = new WildCardMatcher(pattern);
    for (final String word : words) {
        System.out.println("Pattern " + pattern + " matches word '"
                          + word + "': " + matcher.matches(word));
    }
}

Output: 输出:

Pattern foo*bar matches word 'foobar': true
Pattern foo*bar matches word 'foo123bar': true
Pattern foo*bar matches word 'foo*bar': true
Pattern foo*bar matches word 'foobarandsomethingelse': false
Pattern *.* matches word 'somefile.doc': true
Pattern *.* matches word 'somefile': false
Pattern *.* matches word '.doc': true
Pattern *.* matches word 'somefile.': true
Pattern pe* matches word 'peter': true
Pattern pe* matches word 'antipeter': false

While this is far from being production-ready, it should be fast enough and it supports multiple wild cards (including in the first and last place). 尽管这还远远不能投入生产,但它应该足够快,并且支持多个通配符(包括开头和结尾)。 But of course if your wildcards are only at the end, use Peter's answer (+1). 但是,当然,如果您的通配符仅在末尾,请使用彼得的答案(+1)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 比较Java中字符串的最快方法是什么? - What's the quickest way to compare strings in Java? 实现使用Java计算一个字符串数组中每个字符串存在次数的算法的最快方法 - the quickest way to Implement an algorithm to count how many times each string is present in an array of strings using Java 在Java中,检查列表是否包含来自另一个列表的项目的最快方法是什么,两个列表的类型相同? - In Java what is the quickest way to check if list contains items from another list, both list are of same type? Java通用方法,通配符List返回类型 - Java generic methods, wildcard List return type 在 Java 中按键从 Map 中删除元素的最快方法是什么? - What's the quickest way to remove an element from a Map By Key in Java? 在Java中通过值从Map中删除元素的最快方法是什么? - What's the quickest way to remove an element from a Map by value in Java? matchQuery elasticsearch:如何使用java在弹性搜索中搜索字符串列表(通配符) - matchQuery elasticsearch : How to search list of strings(wildcard) in elastic search using java java中的通用集合和通配符 - Generic collection & wildcard in java 在数组中存储和访问字符串的最快方法 - Quickest way of storing and accessing Strings in an array 如何返回 Java 中的字符串列表? - How to return a list of strings in Java?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM