简体   繁体   中英

Java: How to implement wildcard matching?

I'm researching on how to find k values in the BST that are closest to the target, and came across the following implementation with the rules:

'?' Matches any single character.

'*' Matches any sequence of characters (including the empty sequence).

The matching should cover the entire input string (not partial).

The function prototype should be: bool isMatch(const char *s, const char *p)

Some examples:

isMatch("aa","a") → false

isMatch("aa","aa") → true

isMatch("aaa","aa") → false

isMatch("aa", "*") → true

isMatch("aa", "a*") → true

isMatch("ab", "?*") → true

isMatch("aab", "c a b") → false

Code:

import java.util.*;

public class WildcardMatching {
    boolean isMatch(String s, String p) {
        int i=0, j=0;
        int ii=-1, jj=-1;

        while(i<s.length()) {
            if(j<p.length() && p.charAt(j)=='*') {
                ii=i;
                jj=j;
                j++;
            } else if(j<p.length() && 
                      (s.charAt(i) == p.charAt(j) ||
                       p.charAt(j) == '?')) {
                i++;
                j++;
            } else {
                if(jj==-1) return false;

                j=jj;
                i=ii+1;
            }
        }

        while(j<p.length() && p.charAt(j)=='*') j++;

        return j==p.length();
    }

    public static void main(String args[]) {
        String s = "aab";
        String p = "a*";

        WildcardMatching wcm = new WildcardMatching();
        System.out.println(wcm.isMatch(s, p));
    }
}

And my question is, what's the reason for having two additional indexes, ii and jj , and why do they get initialized with -1 ? What's the purpose of each? Wouldn't traversing it with i and j be enough?

And what's the purpose of ii=i; and jj=j; in the first if case, and i=ii+1; and j=jj; in the third if case?

Lastly, in what case would you encounter while(j<p.length() && p.charAt(j)=='*') j++; ?

Examples would be extremely helpful in understanding. Thank you in advance and will accept answer/up vote.

It looks like ii and jj are used to handle the wildcard "*", which matches to any sequence. Their initialization to -1 acts as a flag: it tells us if we've hit an unmatched sequence and are not currently evaluating a "*". We can walk through your examples one at a time.

Notice that i is related to the parameter s (the original string) and j is related to the parameter p (the pattern).

isMatch("aa","a") : this returns false because the j<p.length() statement will fail before we leave the while loop, since the length of p ("a") is only 1 whereas the length of s ("aa") is 2, so we'll jump to the else block. This is where the -1 initialization comes in: since we never saw any wildcards in p , jj is still -1, indicating that there's no way the strings can match, so we return false.

isMatch("aa","aa") : s and p are exactly the same, so the program repeatedly evaluates the else-if block with no problems and finally breaks out of the while loop once i equals 2 (the length of "aa"). The second while loop never runs, since j is not less than p.length() - in fact, since the else-if increments i and j together, they are both equal to 2, and 2 is not less than the length of "aa". We return j == p.length() , which evaluates to 2 == 2 , and get true .

isMatch("aaa","aa") : this one fails for the same reason as the first. Namely, the strings are not the same length and we never hit a wildcard character.

isMatch("aa","*") : this is where it gets interesting. First we'll enter the if block, since we've seen a "*" in p . We set ii and jj to 0 and increment j only. On the second iteration, j<p.length() fails, so we jump to the else block. jj is not -1 anymore (it's 0), so we reset j to 0 and set i to 0+1. This basically allows us to keep evaluating the wildcard, since j just gets reset to jj , which holds the position of the wildcard, and ii tells us where to start from in our original string. This test case also explains the second while loop. In some cases our pattern may be much shorter than the original string, so we need to make sure it's matched up with wildcards. For example, isMatch("aaaaaa","a**") should return true, but the final return statement is checking to see if j == p.length() , asking if we checked the entire pattern. Normally we would stop at the first wildcard, since it matches anything, so we need to finally run through the rest of the pattern and make sure it only contains wildcards.

From here you can figure out the logic behind the other test cases. I hope this helped!

Lets look at this a bit out of order.

First, this is a parallel iteration of the string ( s ) and the wildcard pattern ( p ), using variable i to index s and variable j to index p .

The while loop will stop iterating when end of s is reached. When that happens, hopefully end of p has been reached too, in while case it'll return true ( j==p.length() ).

If however p ends with a * , that is also valid (eg isMatch("ab", "ab*") ), and that's what the while(j<p.length() && p.charAt(j)=='*') j++; loop ensures, ie any * in the pattern at this point is skipped, and if that reaches end of p , then it returns true . If end of p is not reached, it returns false.

That was the answer to your last question. Now lets look at the loop. The else if will iterate both i and j as long as there is a match, eg 'a' == 'a' or 'a' == '?' .

When a * wildcard is found (first if ), it saves the current positions in ii and jj , in case backtracking becomes necessary, then skips the wildcard character.

This basically starts by assuming the wildcard matches the empty string (eg isMatch("ab", "a*b") ). When it continues iterating, the else if will match the rest and method ends up returning true .

Now, if a mismatch is found (the else block), it will try to backtrack. Of course, if it doesn't have a saved wildcard ( jj==-1 ), it can't backtrack, so it just returns false . That's why jj is initialized to -1 , so it can detect if a wildcard was saved. ii could be initialized to anything, but is initialized to -1 for consistency.

If a wildcard position was saved in ii and jj , it will restore those values, then forward i by one, ie assuming that if the next character is matched against the wildcard, the rest of the matching will succeed and return true .

That's the logic. Now, it could be optimized a tiny bit, because that backtracking is sub-optimal. It currently resets j back to the * , and i back to the next character. When it loops around, it will enter the if and save the save value again in jj and save the i value in ii , and then increment j . Since that is a given (unless end of s is reached), the backtracking could just do that too, saving an iteration loop, ie

} else {
    if(jj==-1) return false;

    i=++ii;
    j=jj+1;
}

The code looks buggy to me. (See below)

The ostensible purpose of ii and jj is to implement a form of backtracking.

For example, when you try to match "abcde" against the pattern "a*e", the algorithm will first match the "a" in the pattern against the "a" in the the input string. Then it will eagerly match the "*" against the rest of the string ... and find that it has made a mistake. At that point, it needs to backtrack and try an alternative

The ii and jj are to record the point to backtrack to, and the uses those variables are either recording a new backtrack point, or backtracking.

Or at least, that was probably the author's intent at some point.

The while(j<p.length() && p.charAt(j)=='*') j++; seems to be dealing with an edge-case


However, I don't think this code is correct.

  1. It certainly won't cope with backtracking in the case where there are multiple "*" wildcards in the pattern. That requires a recursive solution.

  2. The part:

     if(j<p.length() && p.charAt(j)=='*') { ii=i; jj=j; j++;

    doesn't make much sense. I'd have thought it should increment i not j . It might "mesh" with the behavior of the else part, but even if it does this is a convoluted way of coding this.


Advice:

  1. Don't use this code as an example. Even if it works (in a limited sense) it is not a good way to do this task, or an example of clarity or good style.
  2. I would handle this by translating the wildcard pattern into a regex and then using Pattern / Matcher to do the matching.

    For example: Wildcard matching in Java

I know you are asking about BST, but to be honest there is also a way of doing that with regex (not for competitive programming, but stable and fast enough be used in a production environment):

import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class WildCardMatcher{

    public static void main(String []args){
        // Test
        String urlPattern = "http://*.my-webdomain.???",
               urlToMatch = "http://webmail.my-webdomain.com";
        WildCardMatcher wildCardMatcher = new WildCardMatcher(urlPattern);
        System.out.printf("\"%s\".matches(\"%s\") -> %s%n", urlToMatch, wildCardMatcher, wildCardMatcher.matches(urlToMatch));
    }
     
    private final Pattern p;
    public WildCardMatcher(final String urlPattern){
       Pattern charsToEscape = Pattern.compile("([^*?]+)([*?]*)");
        
       // here we need to escape all the strings that are not "?" or "*", and replace any "?" and "*" with ".?" and ".*"
       Matcher m = charsToEscape.matcher(urlPattern);
       StringBuffer sb = new StringBuffer();
       String replacement, g1, g2;
       while(m.find()){
           g1 = m.group(1);
           g2 = m.group(2);
           // We first have to escape pattern (original string can contain charachters that are invalid for regex), then escaping the '\' charachters that have a special meaning for replacement strings
           replacement = (g1 == null ? "" : Matcher.quoteReplacement(Pattern.quote(g1))) +
                         (g2 == null ? "" : g2.replaceAll("([*?])", ".$1")); // simply replacing "*" and "?"" with ".*" and ".?"
           m.appendReplacement(sb, replacement);
       }
       m.appendTail(sb);
       p = Pattern.compile(sb.toString());
    }
     
    @Override
    public String toString(){
        return p.toString();
    }
     
    public boolean matches(final String urlToMatch){
        return p.matcher(urlToMatch).matches();
    }
}

There is still a list of optimizations that you can implement (lowecase / uppercase distinction, setting a max-length to the string being checked to prevent attackers to make you check against a 4-GigaByte-String, ...).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM