简体   繁体   English

如何查找给定字符串中所有出现的子字符串(允许使用通配符)

[英]How to find all occurrences of a substring (with wildcards allowed) in a given String

I'm searching for an efficient way for a wildcard-enabled search in Java. 我正在寻找Java中启用通配符的搜索的有效方法。 My first approach was of course to use regex. 我的第一种方法当然是使用正则表达式。 However this approach does NOT find ALL possible matches! 但是,这种方法无法找到所有可能的匹配项!

Here's the code: 这是代码:

    public static ArrayList<StringOccurrence> matchesWildcard(String string, String pattern, boolean printToConsole) {
    Pattern p = Pattern.compile(normalizeWildcards(pattern));
    Matcher m = p.matcher(string);
    ArrayList<StringOccurrence> res = new ArrayList<StringOccurrence>();
    int count = 0;
    while (m.find()){
        res.add(new StringOccurrence(m.start(), m.end(), count, m.group()));
        if(printToConsole)
            System.out.println(count + ") " + m.group() + ", " + m.start() + ", " + m.end());
        count +=1;
    }
    return res;

For a query q: ab*b and a String str: abbccabbccbbb I get the output: 0) abb, 0, 3 1) abb, 5, 8 But the whole String should be also a result, because it matches the pattern. 对于查询q:ab * b和字符串str:abbccabbccbbb,我得到以下输出:0)abb,0,3 1)abb,5,8但是整个String也应该是结果,因为它与模式匹配。 It seems that the Java-implementation of regex starts each new search after the last match... 似乎正则表达式的Java实现在上一次匹配之后开始每个新的搜索...

Any ideas how this could work (or suggestions for frameworks...)? 任何想法如何工作(或对框架的建议...)?

ab*b means "a" followed by zero or more "b" followed by a "b". ab*b表示“ a”,后跟零个或多个“ b”,后跟“ b”。 The minimum match would be "ab". 最小匹配为“ ab”。 Soulds like you're looking for something like: a[az]*b where [az]* indicates zero or more of any lowercase letter. 您可能正在寻找类似的内容: a[az]*b其中[az]*表示零个或多个小写字母。 You may also want to bound it so that the start of the "word" must be an "a" and the end must be a "b": \\ba[az]*b\\b 您可能还希望对其进行绑定,以使“单词”的开头必须为“ a”,结尾必须为“ b”: \\ba[az]*b\\b

You are expecting * to mean .* and .*? 您期望*表示.*.*? at the same time (and more). 同时(以及更多)。

You should reconsider what you really need. 您应该重新考虑您的真正需求。 Let's extend your example: 让我们扩展您的示例:

abbccabbccbbbcabb abbccabbccbbbcabb

Do you really want all possibilities? 您真的想要所有可能性吗?

To achieve what you want you'll have to 要实现您想要的目标,您必须

iterate p1 over all occurrences of "ab"
    from p1+2 on
    iterate p2 over all occurrences of "b"
        output substring between p1 and p2+1

This is the corresponding Java code: 这是相应的Java代码:

public static void main( String[] args ){
    String s = "abbccabbccbbb";
    int f1 = 0;
    int p1;
    while( (p1 = s.indexOf( "ab", f1 )) >= 0 ){
        int f2 = p1 + 2;
        int p2;
        while( (p2 = s.indexOf( "b", f2 )) >= 0 ){
            System.out.println( s.substring( p1, p2 + 1 ) );
            f2 = p2 + 1;
        }
        f1 = p1 + 2;
    }
}

Below is the output. 以下是输出。 You may be surprised - maybe that's more than you expect, but then you'll need to refine your specification. 您可能会感到惊讶-也许超出您的预期,但是随后您需要完善自己的规范。

abb 0:3
abbccab 0:7
abbccabb 0:8
abbccabbccb 0:11
abbccabbccbb 0:12
abbccabbccbbb 0:13
abb 5:8
abbccb 5:11
abbccbb 5:12
abbccbbb 5:13

Later 后来

Why is a single regular expression not capable of doing it? 为什么单个正则表达式不能做到这一点?

The basic mechanism of pattern matching is to try and match the regex against a string, starting at some position, initially 0. If a match is found, this position is advanced according to the matched string . 模式匹配的基本机制是尝试将正则表达式与字符串匹配,从某个位置开始,最初为0。如果找到匹配项, 则根据匹配的string将该位置推进 The pattern matcher never looks back. 模式匹配器永不回头。

A pattern ab.*?b will try and find the next 'b' after an "ab". 模式ab。*?b将尝试在“ ab”之后找到下一个“ b”。 This means that *no match is possible beginning with the same "ab" and ending at some 'b' following that previously found "next 'b'". 这意味着*不可能从相同的“ ab”开始到在先前找到的“ next'b'”之后的某个“ b”处结束匹配。

In other words: one regex cannot find overlapping substrings. 换句话说:一个正则表达式找不到重叠的子字符串。

If you really need all possible matches, this answer is not useful for you (anyway maybe other user finds it useful). 如果您确实需要所有可能的匹配项,那么此答案对您没有用(无论如何其他用户认为它有用)。

If the widest match would be sufficient for you, then use a greedy quantifier (I guess you're using a reluctant one, showing your pattern would be useful). 如果最广泛的匹配足以满足您的需要,请使用贪婪的量词(我想您使用的是勉强的量词,表明您的模式会很有用)。

Google for greedy vs reluctant quantifiers for regex. Google表示贪婪与勉强量词表示正则表达式。

Cheers. 干杯。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Java 中查找和替换字符串中所有出现的子字符串? - How find and replace all occurrences of substring in string in Java? 查找字符串中所有出现的分割子字符串 - Find all occurrences of a divided substring in a string 在Java中查找字符串中出现的所有子字符串 - Find all occurrences of substring in string in Java 如何从字符串中删除所有出现的 substring - How to remove all occurrences of a substring from a string 查找字符串中子字符串的出现次数 - Find the Number of Occurrences of a Substring in a String 查找字符串中所有出现的子字符串 - Finding all occurrences of a substring in a string 如何使用正则表达式查找字符串中子字符串的不同出现? - How to find different occurrences of a substring in a string using regex? 如何找到给定字符串的最长重复子字符串 - How to find the longest repeated substring of given string 如何用Java中的参数替换子字符串的所有出现? - how to substitute all occurrences of a substring with parameters in Java? java - 如何使用模式和匹配器在java中查找子字符串=&quot;\\\\r&quot; in a string=&quot;This is a \\\\\\\\rtest \\\\n\\\\r string&quot; 的出现 - How to find occurrences of a substring="\\r" in a string="This is a \\\\\rtest \\n\\r string" in java using Pattern and matchers
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM