简体   繁体   中英

Wildcard matching in Java

I'm writing a simple debugging program that takes as input simple strings that can contain stars to indicate a wildcard match-any

*.wav  // matches <anything>.wav
(*, a) // matches (<anything>, a)

I thought I would simply take that pattern, escape any regular expression special characters in it, then replace any \\* back to .* . And then use a regular expression matcher.

But I can't find any Java function to escape a regular expression. The best match I could find is Pattern.quote , which however just puts \Q and \E at the begin and end of the string.

Is there anything in Java that allows you to simply do that wildcard matching without you having to implement the algorithm from scratch?

Just escape everything - no harm will come of it.

    String input = "*.wav";
    String regex = ("\\Q" + input + "\\E").replace("*", "\\E.*\\Q");
    System.out.println(regex); // \Q\E.*\Q.wav\E
    System.out.println("abcd.wav".matches(regex)); // true

Or you can use character classes:

    String input = "*.wav";
    String regex = input.replaceAll(".", "[$0]").replace("[*]", ".*");
    System.out.println(regex); // .*[.][w][a][v]
    System.out.println("abcd.wav".matches(regex)); // true

It's easier to "escape" the characters by putting them in a character class, as almost all characters lose any special meaning when in a character class. Unless you're expecting weird file names, this will work.

Using A Simple Regex

One of this method's benefits is that we can easily add tokens besides * (see Adding Tokens at the bottom).

Search: [^*]+|(\\*)

  • The left side of the | matches any chars that are not a star
  • The right side captures all stars to Group 1
  • If Group 1 is empty: replace with \\Q + Match + E
  • If Group 1 is set: replace with .*

Here is some working code (see the output of the online demo ).

Input: audio*2012*.wav

Output: \\Qaudio\\E.*\\Q2012\\E.*\\Q.wav\\E

String subject = "audio*2012*.wav";
Pattern regex = Pattern.compile("[^*]+|(\\*)");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
    if(m.group(1) != null) m.appendReplacement(b, ".*");
    else m.appendReplacement(b, "\\\\Q" + m.group(0) + "\\\\E");
}
m.appendTail(b);
String replaced = b.toString();
System.out.println(replaced);

Adding Tokens

Suppose we also want to convert the wildcard ? , which stands for a single character, by a dot. We just add a capture group to the regex, and exclude it from the matchall on the left:

Search: [^*?]+|(\\*)|(\\?)

In the replace function we the add something like:

else if(m.group(2) != null) m.appendReplacement(b, "."); 

There is small utility method in Apache Commons-IO library: org.apache.commons.io.FilenameUtils#wildcardMatch(), which you can use without intricacies of the regular expression.

API documentation could be found in: https://commons.apache.org/proper/commons-io/javadocs/api-2.5/org/apache/commons/io/FilenameUtils.html#wildcardMatch(java.lang.String,%20java.lang.String)

You can also use the Quotation escape characters: \\\\Q and \\\\E - everything between them is treated as literal and not considered to be part of the regex to be evaluated. Thus this code should work:

    String input = "*.wav";
    String regex = "\\Q" + input.replace("*", "\\E.*?\\Q") + "\\E";

    // regex = "\\Q\\E.*?\\Q.wav\\E"

Note that your * wildcard might also be best matched only against word characters using \\w depending on how you want your wildcard to behave(?)

Lucene has classes that provide this capability, with additional support for backslash as an escape character. ? matches a single character, 1 matches 0 or more characters, \\ escapes the following character. Supports Unicode code points. Supposed to be fast but I haven't tested.

CharacterRunAutomaton characterRunAutomaton;
boolean matches;
characterRunAutomaton = new CharacterRunAutomaton(WildcardQuery.toAutomaton(new Term("", "Walmart")));
matches = characterRunAutomaton.run("Walmart"); // true
matches = characterRunAutomaton.run("Wal*mart"); // false
matches = characterRunAutomaton.run("Wal\\*mart"); // false
matches = characterRunAutomaton.run("Waldomart"); // false
characterRunAutomaton = new CharacterRunAutomaton(WildcardQuery.toAutomaton(new Term("", "Wal*mart")));
matches = characterRunAutomaton.run("Walmart"); // true
matches = characterRunAutomaton.run("Wal*mart"); // true
matches = characterRunAutomaton.run("Wal\\*mart"); // true
matches = characterRunAutomaton.run("Waldomart"); // true
characterRunAutomaton = new CharacterRunAutomaton(WildcardQuery.toAutomaton(new Term("", "Wal\\*mart")));
matches = characterRunAutomaton.run("Walmart"); // false
matches = characterRunAutomaton.run("Wal*mart"); // true
matches = characterRunAutomaton.run("Wal\\*mart"); // false
matches = characterRunAutomaton.run("Waldomart"); // false

Regex While Accommodating A DOS/Windows Path

Implementing the Quotation escape characters \\Q and \\E is probably the best approach. However, since a backslash is typically used as a DOS/Windows file separator, a " \\E " sequence within the path could effect the pairing of \\Q and \\E . While accounting for the * and ? wildcard tokens, this situation of the backslash could be addressed in this manner:

Search: [^*?\\\\]+|(\\*)|(\\?)|(\\\\)

Two new lines would be added in the replace function of the "Using A Simple Regex" example to accommodate the new search pattern. The code would still be "Linux-friendly". As a method, it could be written like this:

public String wildcardToRegex(String wildcardStr) {
    Pattern regex=Pattern.compile("[^*?\\\\]+|(\\*)|(\\?)|(\\\\)");
    Matcher m=regex.matcher(wildcardStr);
    StringBuffer sb=new StringBuffer();
    while (m.find()) {
        if(m.group(1) != null) m.appendReplacement(sb, ".*");
        else if(m.group(2) != null) m.appendReplacement(sb, ".");     
        else if(m.group(3) != null) m.appendReplacement(sb, "\\\\\\\\");
        else m.appendReplacement(sb, "\\\\Q" + m.group(0) + "\\\\E");
    }
    m.appendTail(sb);
    return sb.toString();
}

Code to demonstrate the implementation of this method could be written like this:

String s = "C:\\Temp\\Extra\\audio??2012*.wav";
System.out.println("Input: "+s);
System.out.println("Output: "+wildcardToRegex(s));

This would be the generated results:

Input: C:\Temp\Extra\audio??2012*.wav
Output: \QC:\E\\\QTemp\E\\\QExtra\E\\\Qaudio\E..\Q2012\E.*\Q.wav\E
  // The main function that checks if two given strings match. The pattern string  may contain
  // wildcard characters
  default boolean matchPattern(String pattern, String str) {

    // If we reach at the end of both strings, we are done
    if (pattern.length() == 0 && str.length() == 0) return true;

    // Make sure that the characters after '*' are present in str string. This function assumes that
    // the pattern string will not contain two consecutive '*'
    if (pattern.length() > 1 && pattern.charAt(0) == '*' && str.length() == 0) return false;

    // If the pattern string contains '?', or current characters of both strings match
    if ((pattern.length() > 1 && pattern.charAt(0) == '?')
        || (pattern.length() != 0 && str.length() != 0 && pattern.charAt(0) == str.charAt(0)))
      return matchPattern(pattern.substring(1), str.substring(1));

    // If there is *, then there are two possibilities
    // a: We consider current character of str string
    // b: We ignore current character of str string.
    if (pattern.length() > 0 && pattern.charAt(0) == '*')
      return matchPattern(pattern.substring(1), str) || matchPattern(pattern, str.substring(1));
    return false;
  }

  public static void main(String[] args) {
    test("w*ks", "weeks"); // Yes
    test("we?k*", "weekend"); // Yes
    test("g*k", "gee"); // No because 'k' is not in second
    test("*pqrs", "pqrst"); // No because 't' is not in first
    test("abc*bcd", "abcdhghgbcd"); // Yes
    test("abc*c?d", "abcd"); // No because second must have 2 instances of 'c'
    test("*c*d", "abcd"); // Yes
    test("*?c*d", "abcd"); // Yes
  }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM