简体   繁体   中英

generating a regular expression from a string

I wish to generate a regular expression from a string containing numbers, and then use this as a Pattern to search for similar strings. Example:

String s = "Page 3 of 23"

If I substitute all digits by \\d

    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < s.length(); i++) {
    char c = s.charAt(i);
    if (Character.isDigit(c)) {
        sb.append("\\d"); // backslash d
    } else {
        sb.append(c);
        }
    }

    Pattern numberPattern = Pattern.compile(sb.toString());

//    Pattern numberPattern = Pattern.compile("Page \d of \d\d");

I can use this to match similar strings (eg "Page 7 of 47" ). My problem is that if I do this naively some of the metacharacters such as (){}- , etc. will not be escaped. Is there a library to do this or an exhaustive set of characters for regular expressions which I must and must not escape? (I can try to extract them from the Javadocs but am worried about missing something).

Alternatively is there a library which already does this (I don't at this stage want to use a full Natural Language Processing solution).

NOTE: @dasblinkenlight's edited answer now works for me!

Java's regexp library provides this functionality:

String s = Pattern.quote(orig);

The "quoted" string will have all its metacharacters escaped. First, escape your string, and then go through it and replace digits by \\d to make a regular expression. Since regex library uses \\Q and \\E for quoting, you need to enclose your portion of regex in inverse quotes of \\E and \\Q .

One thing I would change in your implementation is the replacement algorithm: rather than replacing character-by-character, I would replace digits in groups. This would let an expression produced from Page 3 of 23 match strings like Page 13 of 23 and Page 6 of 8 .

String p = Pattern.quote(orig).replaceAll("\\d+", "\\\\E\\\\d+\\\\Q");

This would produce "\\QPage \\E\\d+\\Q of \\E\\d+\\Q\\E" no matter what page numbers and counts were there originally. The output needs only one, not two slashes in \\d , because the result is fed directly to regex engine, bypassing the Java compiler.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM