简体   繁体   English

从字符串生成正则表达式

[英]generating a regular expression from a string

I wish to generate a regular expression from a string containing numbers, and then use this as a Pattern to search for similar strings. 我希望从包含数字的字符串生成正则表达式,然后将其用作模式来搜索类似的字符串。 Example: 例:

String s = "Page 3 of 23"

If I substitute all digits by \\d 如果我用\\d替换所有数字

    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < s.length(); i++) {
    char c = s.charAt(i);
    if (Character.isDigit(c)) {
        sb.append("\\d"); // backslash d
    } else {
        sb.append(c);
        }
    }

    Pattern numberPattern = Pattern.compile(sb.toString());

//    Pattern numberPattern = Pattern.compile("Page \d of \d\d");

I can use this to match similar strings (eg "Page 7 of 47" ). 我可以用它来匹配类似的字符串(例如"Page 7 of 47" )。 My problem is that if I do this naively some of the metacharacters such as (){}- , etc. will not be escaped. 我的问题是,如果我天真地这样做,一些元字符,如(){}-等将不会被转义。 Is there a library to do this or an exhaustive set of characters for regular expressions which I must and must not escape? 是否有一个库可以执行此操作,或者是正则表达式的详尽字符集,我必须且不能逃避? (I can try to extract them from the Javadocs but am worried about missing something). (我可以尝试从Javadocs中提取它们,但我担心会遗漏一些东西)。

Alternatively is there a library which already does this (I don't at this stage want to use a full Natural Language Processing solution). 或者有一个已经这样做的库(我现在不想使用完整的自然语言处理解决方案)。

NOTE: @dasblinkenlight's edited answer now works for me! 注意:@ dasblinkenlight编辑的答案现在适合我!

Java's regexp library provides this functionality: Java的regexp库提供了这个功能:

String s = Pattern.quote(orig);

The "quoted" string will have all its metacharacters escaped. “引用”字符串将使其所有元字符都被转义。 First, escape your string, and then go through it and replace digits by \\d to make a regular expression. 首先,转义你的字符串,然后遍历它并用\\d替换数字来制作正则表达式。 Since regex library uses \\Q and \\E for quoting, you need to enclose your portion of regex in inverse quotes of \\E and \\Q . 由于正则表达式库使用\\Q\\E进行引用,因此需要将正则表达式的部分用\\E\\Q反引号括起来。

One thing I would change in your implementation is the replacement algorithm: rather than replacing character-by-character, I would replace digits in groups. 我将在实现中改变的一件事是替换算法:我会替换组中的数字,而不是逐字符替换。 This would let an expression produced from Page 3 of 23 match strings like Page 13 of 23 and Page 6 of 8 . 这将使得从Page 3 of 23 Page 13 of 23Page 6 of 8 Page 13 of 23 Page 6 of 8生成的表达式匹配。

String p = Pattern.quote(orig).replaceAll("\\d+", "\\\\E\\\\d+\\\\Q");

This would produce "\\QPage \\E\\d+\\Q of \\E\\d+\\Q\\E" no matter what page numbers and counts were there originally. 无论最初的页码和计数是什么,这都会产生 "\\QPage \\E\\d+\\Q of \\E\\d+\\Q\\E" The output needs only one, not two slashes in \\d , because the result is fed directly to regex engine, bypassing the Java compiler. 输出在\\d只需要一个而不是两个斜杠,因为结果直接送到regex引擎,绕过Java编译器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM