简体   繁体   English

Java:如何实现通配符匹配?

[英]Java: How to implement wildcard matching?

I'm researching on how to find k values in the BST that are closest to the target, and came across the following implementation with the rules:我正在研究如何在 BST 中找到最接近目标的 k 值,并且遇到了以下带有规则的实现:

'?' '? Matches any single character.匹配任何单个字符。

'*' Matches any sequence of characters (including the empty sequence). '*' 匹配任何字符序列(包括空序列)。

The matching should cover the entire input string (not partial).匹配应该覆盖整个输入字符串(不是部分)。

The function prototype should be: bool isMatch(const char *s, const char *p)函数原型应该是: bool isMatch(const char *s, const char *p)

Some examples:一些例子:

isMatch("aa","a") → false isMatch("aa","a") → false

isMatch("aa","aa") → true isMatch("aa","aa") → 真

isMatch("aaa","aa") → false isMatch("aaa","aa") → false

isMatch("aa", "*") → true isMatch("aa", "*") → true

isMatch("aa", "a*") → true isMatch("aa", "a*") → true

isMatch("ab", "?*") → true isMatch("ab", "?*") → true

isMatch("aab", "c a b") → false isMatch("aab", "c a b") → false

Code:代码:

import java.util.*;

public class WildcardMatching {
    boolean isMatch(String s, String p) {
        int i=0, j=0;
        int ii=-1, jj=-1;

        while(i<s.length()) {
            if(j<p.length() && p.charAt(j)=='*') {
                ii=i;
                jj=j;
                j++;
            } else if(j<p.length() && 
                      (s.charAt(i) == p.charAt(j) ||
                       p.charAt(j) == '?')) {
                i++;
                j++;
            } else {
                if(jj==-1) return false;

                j=jj;
                i=ii+1;
            }
        }

        while(j<p.length() && p.charAt(j)=='*') j++;

        return j==p.length();
    }

    public static void main(String args[]) {
        String s = "aab";
        String p = "a*";

        WildcardMatching wcm = new WildcardMatching();
        System.out.println(wcm.isMatch(s, p));
    }
}

And my question is, what's the reason for having two additional indexes, ii and jj , and why do they get initialized with -1 ?我的问题是,有两个额外的索引iijj的原因是什么,为什么它们被初始化为-1 What's the purpose of each?每个的目的是什么? Wouldn't traversing it with i and j be enough?ij遍历它还不够吗?

And what's the purpose of ii=i; ii=i;的目的是什么ii=i; and jj=j;并且jj=j; in the first if case, and i=ii+1;在第一种情况下, i=ii+1; and j=jj;并且j=jj; in the third if case?在第三种情况下?

Lastly, in what case would you encounter while(j<p.length() && p.charAt(j)=='*') j++;最后,在什么情况下你会遇到while(j<p.length() && p.charAt(j)=='*') j++; ? ?

Examples would be extremely helpful in understanding.例子对理解非常有帮助。 Thank you in advance and will accept answer/up vote.在此先感谢您,并将接受回答/投票。

It looks like ii and jj are used to handle the wildcard "*", which matches to any sequence.看起来iijj用于处理通配符“*”,它匹配任何序列。 Their initialization to -1 acts as a flag: it tells us if we've hit an unmatched sequence and are not currently evaluating a "*".它们对 -1 的初始化充当一个标志:它告诉我们是否遇到了不匹配的序列并且当前没有评估“*”。 We can walk through your examples one at a time.我们可以一次一个地浏览您的示例。

Notice that i is related to the parameter s (the original string) and j is related to the parameter p (the pattern).请注意, i与参数s (原始字符串)相关,而j与参数p (模式)相关。

isMatch("aa","a") : this returns false because the j<p.length() statement will fail before we leave the while loop, since the length of p ("a") is only 1 whereas the length of s ("aa") is 2, so we'll jump to the else block. isMatch("aa","a") :这将返回 false 因为j<p.length()语句将在我们离开 while 循环之前失败,因为p ("a") 的长度仅为 1 而长度为s ("aa") 是 2,所以我们将跳转到 else 块。 This is where the -1 initialization comes in: since we never saw any wildcards in p , jj is still -1, indicating that there's no way the strings can match, so we return false.这就是 -1 初始化的用武之地:因为我们从未在p看到任何通配符,所以jj仍然是 -1,表明字符串无法匹配,因此我们返回 false。

isMatch("aa","aa") : s and p are exactly the same, so the program repeatedly evaluates the else-if block with no problems and finally breaks out of the while loop once i equals 2 (the length of "aa"). isMatch("aa","aa") : sp完全一样,所以程序反复评估 else-if 块没有问题,一旦i等于 2(“aa 的长度”),最终跳出 while 循环”)。 The second while loop never runs, since j is not less than p.length() - in fact, since the else-if increments i and j together, they are both equal to 2, and 2 is not less than the length of "aa".第二个while循环永远不会运行,因为j不小于p.length() ——事实上,由于else-if将ij在一起,它们都等于2,而且2不小于“的长度”啊”。 We return j == p.length() , which evaluates to 2 == 2 , and get true .我们返回j == p.length() ,其计算结果为2 == 2 ,并得到true

isMatch("aaa","aa") : this one fails for the same reason as the first. isMatch("aaa","aa") :这个失败的原因与第一个相同。 Namely, the strings are not the same length and we never hit a wildcard character.也就是说,字符串的长度不同,我们从未遇到通配符。

isMatch("aa","*") : this is where it gets interesting. isMatch("aa","*") :这就是有趣的地方。 First we'll enter the if block, since we've seen a "*" in p .首先我们将进入 if 块,因为我们在p看到了一个“*”。 We set ii and jj to 0 and increment j only.我们将iijj设置为 0 并仅增加j On the second iteration, j<p.length() fails, so we jump to the else block.在第二次迭代中, j<p.length()失败,所以我们跳转到 else 块。 jj is not -1 anymore (it's 0), so we reset j to 0 and set i to 0+1. jj不再是 -1(它是 0),所以我们将j重置为 0 并将i设置为 0+1。 This basically allows us to keep evaluating the wildcard, since j just gets reset to jj , which holds the position of the wildcard, and ii tells us where to start from in our original string.这基本上允许我们继续评估通配符,因为j只是被重置为jj ,它保存了通配符的位置,而ii告诉我们从原始字符串中的哪里开始。 This test case also explains the second while loop.这个测试用例还解释了第二个 while 循环。 In some cases our pattern may be much shorter than the original string, so we need to make sure it's matched up with wildcards.在某些情况下,我们的模式可能比原始字符串短得多,因此我们需要确保它与通配符匹配。 For example, isMatch("aaaaaa","a**") should return true, but the final return statement is checking to see if j == p.length() , asking if we checked the entire pattern.例如, isMatch("aaaaaa","a**")应该返回 true,但最终的 return 语句是检查j == p.length() ,询问我们是否检查了整个模式。 Normally we would stop at the first wildcard, since it matches anything, so we need to finally run through the rest of the pattern and make sure it only contains wildcards.通常我们会在第一个通配符处停止,因为它匹配任何东西,所以我们最终需要遍历模式的其余部分并确保它只包含通配符。

From here you can figure out the logic behind the other test cases.从这里您可以找出其他测试用例背后的逻辑。 I hope this helped!我希望这有帮助!

Lets look at this a bit out of order.让我们看看这个有点乱。

First, this is a parallel iteration of the string ( s ) and the wildcard pattern ( p ), using variable i to index s and variable j to index p .首先,这是字符串 ( s ) 和通配符模式 ( p ) 的并行迭代,使用变量i索引s和变量j索引p

The while loop will stop iterating when end of s is reached.当到达s结尾时, while循环将停止迭代。 When that happens, hopefully end of p has been reached too, in while case it'll return true ( j==p.length() ).发生这种情况时,希望也已到达p结尾,在这种情况下,它将返回true ( j==p.length() )。

If however p ends with a * , that is also valid (eg isMatch("ab", "ab*") ), and that's what the while(j<p.length() && p.charAt(j)=='*') j++;然而,如果p*结尾,那也是有效的(例如isMatch("ab", "ab*") ),这就是while(j<p.length() && p.charAt(j)=='*') j++; loop ensures, ie any * in the pattern at this point is skipped, and if that reaches end of p , then it returns true .循环确保,即此时模式中的任何*被跳过,如果到达p末尾,则返回true If end of p is not reached, it returns false.如果未到达p结尾,则返回 false。

That was the answer to your last question.那是你最后一个问题的答案。 Now lets look at the loop.现在让我们看看循环。 The else if will iterate both i and j as long as there is a match, eg 'a' == 'a' or 'a' == '?'只要存在匹配, else if就会迭代ij ,例如'a' == 'a''a' == '?' . .

When a * wildcard is found (first if ), it saves the current positions in ii and jj , in case backtracking becomes necessary, then skips the wildcard character.当找到*通配符时(首先是if ),它将当前位置保存在iijj ,以防需要回溯,然后跳过通配符。

This basically starts by assuming the wildcard matches the empty string (eg isMatch("ab", "a*b") ).这基本上从假设通配符匹配空字符串开始(例如isMatch("ab", "a*b") )。 When it continues iterating, the else if will match the rest and method ends up returning true .当它继续迭代时, else if将匹配其余部分,并且方法最终返回true

Now, if a mismatch is found (the else block), it will try to backtrack.现在,如果发现不匹配( else块),它将尝试回溯。 Of course, if it doesn't have a saved wildcard ( jj==-1 ), it can't backtrack, so it just returns false .当然,如果它没有保存的通配符( jj==-1 ),它就不能回溯,所以它只返回false That's why jj is initialized to -1 , so it can detect if a wildcard was saved.这就是jj被初始化为-1的原因,因此它可以检测是否保存了通配符。 ii could be initialized to anything, but is initialized to -1 for consistency. ii可以被初始化为任何东西,但为了一致性被初始化为-1

If a wildcard position was saved in ii and jj , it will restore those values, then forward i by one, ie assuming that if the next character is matched against the wildcard, the rest of the matching will succeed and return true .如果在iijj保存了通配符位置,它将恢复这些值,然后将i转发一个,即假设如果下一个字符与通配符匹配,则其余匹配将成功并返回true

That's the logic.这就是逻辑。 Now, it could be optimized a tiny bit, because that backtracking is sub-optimal.现在,它可以稍微优化一下,因为回溯是次优的。 It currently resets j back to the * , and i back to the next character.它当前将j重置回* ,并将i重置回下一个字符。 When it loops around, it will enter the if and save the save value again in jj and save the i value in ii , and then increment j .当它循环时,它会进入if并再次将保存值保存在jj并将i值保存在ii ,然后增加j Since that is a given (unless end of s is reached), the backtracking could just do that too, saving an iteration loop, ie由于这是给定的(除非到达s结尾),回溯也可以这样做,从而节省迭代循环,即

} else {
    if(jj==-1) return false;

    i=++ii;
    j=jj+1;
}

The code looks buggy to me.代码在我看来有问题。 (See below) (见下文)

The ostensible purpose of ii and jj is to implement a form of backtracking. iijj的表面目的是实现一种形式的回溯。

For example, when you try to match "abcde" against the pattern "a*e", the algorithm will first match the "a" in the pattern against the "a" in the the input string.例如,当您尝试将“abcde”与模式“a*e”进行匹配时,算法将首先将模式中的“a”与输入字符串中的“a”进行匹配。 Then it will eagerly match the "*" against the rest of the string ... and find that it has made a mistake.然后它会急切地将“*”与字符串的其余部分进行匹配……并发现它犯了一个错误。 At that point, it needs to backtrack and try an alternative那时,它需要回溯并尝试替代方案

The ii and jj are to record the point to backtrack to, and the uses those variables are either recording a new backtrack point, or backtracking. iijj用于记录要回溯的点,这些变量的用途是记录新的回溯点或回溯。

Or at least, that was probably the author's intent at some point.或者至少,这可能是作者在某个时候的意图。

The while(j<p.length() && p.charAt(j)=='*') j++; while(j<p.length() && p.charAt(j)=='*') j++; seems to be dealing with an edge-case似乎正在处理边缘情况


However, I don't think this code is correct.但是,我认为这段代码不正确。

  1. It certainly won't cope with backtracking in the case where there are multiple "*" wildcards in the pattern.在模式中有多个“*”通配符的情况下,它肯定不会处理回溯。 That requires a recursive solution.这需要递归解决方案。

  2. The part:那个部分:

     if(j<p.length() && p.charAt(j)=='*') { ii=i; jj=j; j++;

    doesn't make much sense.没有多大意义。 I'd have thought it should increment i not j .我原以为它应该增加i而不是j It might "mesh" with the behavior of the else part, but even if it does this is a convoluted way of coding this.它可能与else部分的行为“啮合”,但即使这样做也是一种复杂的编码方式。


Advice:建议:

  1. Don't use this code as an example.不要使用此代码作为示例。 Even if it works (in a limited sense) it is not a good way to do this task, or an example of clarity or good style.即使它有效(在有限的意义上),也不是完成此任务的好方法,也不是清晰或良好风格的示例。
  2. I would handle this by translating the wildcard pattern into a regex and then using Pattern / Matcher to do the matching.我会通过将通配符模式转换为正则表达式,然后使用Pattern / Matcher进行Matcher来处理这个问题。

    For example: Wildcard matching in Java例如: Java 中的通配符匹配

I know you are asking about BST, but to be honest there is also a way of doing that with regex (not for competitive programming, but stable and fast enough be used in a production environment):我知道你在问 BST,但老实说,也有一种使用正则表达式的方法(不是用于竞争性编程,但在生产环境中使用足够稳定和快速):

import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class WildCardMatcher{

    public static void main(String []args){
        // Test
        String urlPattern = "http://*.my-webdomain.???",
               urlToMatch = "http://webmail.my-webdomain.com";
        WildCardMatcher wildCardMatcher = new WildCardMatcher(urlPattern);
        System.out.printf("\"%s\".matches(\"%s\") -> %s%n", urlToMatch, wildCardMatcher, wildCardMatcher.matches(urlToMatch));
    }
     
    private final Pattern p;
    public WildCardMatcher(final String urlPattern){
       Pattern charsToEscape = Pattern.compile("([^*?]+)([*?]*)");
        
       // here we need to escape all the strings that are not "?" or "*", and replace any "?" and "*" with ".?" and ".*"
       Matcher m = charsToEscape.matcher(urlPattern);
       StringBuffer sb = new StringBuffer();
       String replacement, g1, g2;
       while(m.find()){
           g1 = m.group(1);
           g2 = m.group(2);
           // We first have to escape pattern (original string can contain charachters that are invalid for regex), then escaping the '\' charachters that have a special meaning for replacement strings
           replacement = (g1 == null ? "" : Matcher.quoteReplacement(Pattern.quote(g1))) +
                         (g2 == null ? "" : g2.replaceAll("([*?])", ".$1")); // simply replacing "*" and "?"" with ".*" and ".?"
           m.appendReplacement(sb, replacement);
       }
       m.appendTail(sb);
       p = Pattern.compile(sb.toString());
    }
     
    @Override
    public String toString(){
        return p.toString();
    }
     
    public boolean matches(final String urlToMatch){
        return p.matcher(urlToMatch).matches();
    }
}

There is still a list of optimizations that you can implement (lowecase / uppercase distinction, setting a max-length to the string being checked to prevent attackers to make you check against a 4-GigaByte-String, ...).您仍然可以实现一系列优化(小写/大写区分,为要检查的字符串设置最大长度以防止攻击者让您检查 4-GigaByte-String,...)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM