简体   繁体   中英

How to find all occurrences of a substring (with wildcards allowed) in a given String

I'm searching for an efficient way for a wildcard-enabled search in Java. My first approach was of course to use regex. However this approach does NOT find ALL possible matches!

Here's the code:

    public static ArrayList<StringOccurrence> matchesWildcard(String string, String pattern, boolean printToConsole) {
    Pattern p = Pattern.compile(normalizeWildcards(pattern));
    Matcher m = p.matcher(string);
    ArrayList<StringOccurrence> res = new ArrayList<StringOccurrence>();
    int count = 0;
    while (m.find()){
        res.add(new StringOccurrence(m.start(), m.end(), count, m.group()));
        if(printToConsole)
            System.out.println(count + ") " + m.group() + ", " + m.start() + ", " + m.end());
        count +=1;
    }
    return res;

For a query q: ab*b and a String str: abbccabbccbbb I get the output: 0) abb, 0, 3 1) abb, 5, 8 But the whole String should be also a result, because it matches the pattern. It seems that the Java-implementation of regex starts each new search after the last match...

Any ideas how this could work (or suggestions for frameworks...)?

ab*b means "a" followed by zero or more "b" followed by a "b". The minimum match would be "ab". Soulds like you're looking for something like: a[az]*b where [az]* indicates zero or more of any lowercase letter. You may also want to bound it so that the start of the "word" must be an "a" and the end must be a "b": \\ba[az]*b\\b

You are expecting * to mean .* and .*? at the same time (and more).

You should reconsider what you really need. Let's extend your example:

abbccabbccbbbcabb

Do you really want all possibilities?

To achieve what you want you'll have to

iterate p1 over all occurrences of "ab"
    from p1+2 on
    iterate p2 over all occurrences of "b"
        output substring between p1 and p2+1

This is the corresponding Java code:

public static void main( String[] args ){
    String s = "abbccabbccbbb";
    int f1 = 0;
    int p1;
    while( (p1 = s.indexOf( "ab", f1 )) >= 0 ){
        int f2 = p1 + 2;
        int p2;
        while( (p2 = s.indexOf( "b", f2 )) >= 0 ){
            System.out.println( s.substring( p1, p2 + 1 ) );
            f2 = p2 + 1;
        }
        f1 = p1 + 2;
    }
}

Below is the output. You may be surprised - maybe that's more than you expect, but then you'll need to refine your specification.

abb 0:3
abbccab 0:7
abbccabb 0:8
abbccabbccb 0:11
abbccabbccbb 0:12
abbccabbccbbb 0:13
abb 5:8
abbccb 5:11
abbccbb 5:12
abbccbbb 5:13

Later

Why is a single regular expression not capable of doing it?

The basic mechanism of pattern matching is to try and match the regex against a string, starting at some position, initially 0. If a match is found, this position is advanced according to the matched string . The pattern matcher never looks back.

A pattern ab.*?b will try and find the next 'b' after an "ab". This means that *no match is possible beginning with the same "ab" and ending at some 'b' following that previously found "next 'b'".

In other words: one regex cannot find overlapping substrings.

If you really need all possible matches, this answer is not useful for you (anyway maybe other user finds it useful).

If the widest match would be sufficient for you, then use a greedy quantifier (I guess you're using a reluctant one, showing your pattern would be useful).

Google for greedy vs reluctant quantifiers for regex.

Cheers.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM