简体   繁体   English

Java中的通配符匹配

[英]Wildcard matching in Java

I'm writing a simple debugging program that takes as input simple strings that can contain stars to indicate a wildcard match-any我正在编写一个简单的调试程序,它将简单的字符串作为输入,这些字符串可以包含星号以指示通配符匹配

*.wav  // matches <anything>.wav
(*, a) // matches (<anything>, a)

I thought I would simply take that pattern, escape any regular expression special characters in it, then replace any \\* back to .* .我想我会简单地采用该模式,转义其中的任何正则表达式特殊字符,然后将任何\\*替换回.* And then use a regular expression matcher.然后使用正则表达式匹配器。

But I can't find any Java function to escape a regular expression.但是我找不到任何 Java function 来转义正则表达式。 The best match I could find is Pattern.quote , which however just puts \Q and \E at the begin and end of the string.我能找到的最佳匹配是Pattern.quote ,但是它只是将\Q\E放在字符串的开头和结尾。

Is there anything in Java that allows you to simply do that wildcard matching without you having to implement the algorithm from scratch? Java 中是否有任何东西可以让您简单地进行通配符匹配,而无需从头开始实现算法?

Just escape everything - no harm will come of it. 逃避一切 - 不会有任何伤害。

    String input = "*.wav";
    String regex = ("\\Q" + input + "\\E").replace("*", "\\E.*\\Q");
    System.out.println(regex); // \Q\E.*\Q.wav\E
    System.out.println("abcd.wav".matches(regex)); // true

Or you can use character classes: 或者您可以使用字符类:

    String input = "*.wav";
    String regex = input.replaceAll(".", "[$0]").replace("[*]", ".*");
    System.out.println(regex); // .*[.][w][a][v]
    System.out.println("abcd.wav".matches(regex)); // true

It's easier to "escape" the characters by putting them in a character class, as almost all characters lose any special meaning when in a character class. 通过将字符放入字符类来“转义”字符更容易,因为在字符类中几乎所有字符都会失去任何特殊含义。 Unless you're expecting weird file names, this will work. 除非您期待奇怪的文件名,否则这将起作用。

Using A Simple Regex 使用简单的正则表达式

One of this method's benefits is that we can easily add tokens besides * (see Adding Tokens at the bottom). 这种方法的好处之一是除了*之外我们可以轻松添加令牌(请参阅底部添加令牌 )。

Search: [^*]+|(\\*) 搜索: [^*]+|(\\*)

  • The left side of the | |的左侧 matches any chars that are not a star 匹配任何不是明星的字符
  • The right side captures all stars to Group 1 右侧将所有星星捕获到第1组
  • If Group 1 is empty: replace with \\Q + Match + E 如果组1为空:替换为\\Q + Match + E
  • If Group 1 is set: replace with .* 如果设置了组1:替换为.*

Here is some working code (see the output of the online demo ). 这是一些工作代码(参见在线演示的输出)。

Input: audio*2012*.wav 输入: audio*2012*.wav

Output: \\Qaudio\\E.*\\Q2012\\E.*\\Q.wav\\E 输出: \\Qaudio\\E.*\\Q2012\\E.*\\Q.wav\\E

String subject = "audio*2012*.wav";
Pattern regex = Pattern.compile("[^*]+|(\\*)");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
    if(m.group(1) != null) m.appendReplacement(b, ".*");
    else m.appendReplacement(b, "\\\\Q" + m.group(0) + "\\\\E");
}
m.appendTail(b);
String replaced = b.toString();
System.out.println(replaced);

Adding Tokens 添加令牌

Suppose we also want to convert the wildcard ? 假设我们还想转换通配符? , which stands for a single character, by a dot. ,用点代表单个字符。 We just add a capture group to the regex, and exclude it from the matchall on the left: 我们只是在正则表达式中添加一个捕获组,并将其从左侧的matchall中排除:

Search: [^*?]+|(\\*)|(\\?) 搜索: [^*?]+|(\\*)|(\\?)

In the replace function we the add something like: 在替换函数中我们添加如下内容:

else if(m.group(2) != null) m.appendReplacement(b, "."); 

There is small utility method in Apache Commons-IO library: org.apache.commons.io.FilenameUtils#wildcardMatch(), which you can use without intricacies of the regular expression. Apache Commons-IO库中有一个小实用工具方法:org.apache.commons.io.FilenameUtils #wildcardMatch(),您可以使用它而不需要复杂的正则表达式。

API documentation could be found in: https://commons.apache.org/proper/commons-io/javadocs/api-2.5/org/apache/commons/io/FilenameUtils.html#wildcardMatch(java.lang.String,%20java.lang.String) API文档可以在以下网址找到: https//commons.apache.org/proper/commons-io/javadocs/api-2.5/org/apache/commons/io/FilenameUtils.html#wildcardMatch(java.lang.String,% 20java.lang.String)

You can also use the Quotation escape characters: \\\\Q and \\\\E - everything between them is treated as literal and not considered to be part of the regex to be evaluated. 您还可以使用引号转义字符: \\\\Q and \\\\E - 它们之间的所有内容都被视为文字,并且不被视为要评估的正则表达式的一部分。 Thus this code should work: 因此,此代码应该工作:

    String input = "*.wav";
    String regex = "\\Q" + input.replace("*", "\\E.*?\\Q") + "\\E";

    // regex = "\\Q\\E.*?\\Q.wav\\E"

Note that your * wildcard might also be best matched only against word characters using \\w depending on how you want your wildcard to behave(?) 请注意,您的*通配符也可能只与使用\\ w的单词字符匹配,具体取决于您希望通配符的行为方式(?)

Lucene has classes that provide this capability, with additional support for backslash as an escape character. Lucene具有提供此功能的类,并且还支持反斜杠作为转义字符。 ? matches a single character, 1 matches 0 or more characters, \\ escapes the following character. 匹配单个字符, 1匹配0个或更多字符, \\转义后续字符。 Supports Unicode code points. 支持Unicode代码点。 Supposed to be fast but I haven't tested. 假设速度很快,但我没有测试过。

CharacterRunAutomaton characterRunAutomaton;
boolean matches;
characterRunAutomaton = new CharacterRunAutomaton(WildcardQuery.toAutomaton(new Term("", "Walmart")));
matches = characterRunAutomaton.run("Walmart"); // true
matches = characterRunAutomaton.run("Wal*mart"); // false
matches = characterRunAutomaton.run("Wal\\*mart"); // false
matches = characterRunAutomaton.run("Waldomart"); // false
characterRunAutomaton = new CharacterRunAutomaton(WildcardQuery.toAutomaton(new Term("", "Wal*mart")));
matches = characterRunAutomaton.run("Walmart"); // true
matches = characterRunAutomaton.run("Wal*mart"); // true
matches = characterRunAutomaton.run("Wal\\*mart"); // true
matches = characterRunAutomaton.run("Waldomart"); // true
characterRunAutomaton = new CharacterRunAutomaton(WildcardQuery.toAutomaton(new Term("", "Wal\\*mart")));
matches = characterRunAutomaton.run("Walmart"); // false
matches = characterRunAutomaton.run("Wal*mart"); // true
matches = characterRunAutomaton.run("Wal\\*mart"); // false
matches = characterRunAutomaton.run("Waldomart"); // false

Regex While Accommodating A DOS/Windows Path 适应DOS / Windows路径时的正则表达式

Implementing the Quotation escape characters \\Q and \\E is probably the best approach. 实现Quotation转义字符\\Q\\E可能是最好的方法。 However, since a backslash is typically used as a DOS/Windows file separator, a " \\E " sequence within the path could effect the pairing of \\Q and \\E . 但是,由于反斜杠通常用作DOS / Windows文件分隔符,因此路径中的“ \\E ”序列可能会影响\\Q\\E的配对。 While accounting for the * and ? *? wildcard tokens, this situation of the backslash could be addressed in this manner: 通配符令牌,这种反斜杠的情况可以这种方式解决:

Search: [^*?\\\\]+|(\\*)|(\\?)|(\\\\) 搜索: [^*?\\\\]+|(\\*)|(\\?)|(\\\\)

Two new lines would be added in the replace function of the "Using A Simple Regex" example to accommodate the new search pattern. 在“使用简单正则表达式”示例的替换功能中将添加两个新行以适应新的搜索模式。 The code would still be "Linux-friendly". 代码仍然是“Linux友好的”。 As a method, it could be written like this: 作为一种方法,它可以这样写:

public String wildcardToRegex(String wildcardStr) {
    Pattern regex=Pattern.compile("[^*?\\\\]+|(\\*)|(\\?)|(\\\\)");
    Matcher m=regex.matcher(wildcardStr);
    StringBuffer sb=new StringBuffer();
    while (m.find()) {
        if(m.group(1) != null) m.appendReplacement(sb, ".*");
        else if(m.group(2) != null) m.appendReplacement(sb, ".");     
        else if(m.group(3) != null) m.appendReplacement(sb, "\\\\\\\\");
        else m.appendReplacement(sb, "\\\\Q" + m.group(0) + "\\\\E");
    }
    m.appendTail(sb);
    return sb.toString();
}

Code to demonstrate the implementation of this method could be written like this: 用于演示此方法实现的代码可以这样写:

String s = "C:\\Temp\\Extra\\audio??2012*.wav";
System.out.println("Input: "+s);
System.out.println("Output: "+wildcardToRegex(s));

This would be the generated results: 这将是生成的结果:

Input: C:\Temp\Extra\audio??2012*.wav
Output: \QC:\E\\\QTemp\E\\\QExtra\E\\\Qaudio\E..\Q2012\E.*\Q.wav\E
  // The main function that checks if two given strings match. The pattern string  may contain
  // wildcard characters
  default boolean matchPattern(String pattern, String str) {

    // If we reach at the end of both strings, we are done
    if (pattern.length() == 0 && str.length() == 0) return true;

    // Make sure that the characters after '*' are present in str string. This function assumes that
    // the pattern string will not contain two consecutive '*'
    if (pattern.length() > 1 && pattern.charAt(0) == '*' && str.length() == 0) return false;

    // If the pattern string contains '?', or current characters of both strings match
    if ((pattern.length() > 1 && pattern.charAt(0) == '?')
        || (pattern.length() != 0 && str.length() != 0 && pattern.charAt(0) == str.charAt(0)))
      return matchPattern(pattern.substring(1), str.substring(1));

    // If there is *, then there are two possibilities
    // a: We consider current character of str string
    // b: We ignore current character of str string.
    if (pattern.length() > 0 && pattern.charAt(0) == '*')
      return matchPattern(pattern.substring(1), str) || matchPattern(pattern, str.substring(1));
    return false;
  }

  public static void main(String[] args) {
    test("w*ks", "weeks"); // Yes
    test("we?k*", "weekend"); // Yes
    test("g*k", "gee"); // No because 'k' is not in second
    test("*pqrs", "pqrst"); // No because 't' is not in first
    test("abc*bcd", "abcdhghgbcd"); // Yes
    test("abc*c?d", "abcd"); // No because second must have 2 instances of 'c'
    test("*c*d", "abcd"); // Yes
    test("*?c*d", "abcd"); // Yes
  }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM