简体   繁体   English

Java正则表达式向后引用两位数

[英]Java regex backreference for two digits

I am working with a regex and I want to use it on the replaceAll method of the String class in Java. 我正在使用正则表达式,并且想在Java中String类的replaceAll方法上使用它。

My regex works fine and groupCount() returns 11. So, when I try to replace my text using backreference pointing to the eleventh group, I am getting the first group with a "1" attached to it, instead of the group eleven. 我的正则表达式可以正常工作,并且groupCount()返回11。因此,当我尝试使用指向第11个组的后向引用替换文本时,我得到的第一个组带有附加的“ 1”,而不是第11个组。

String regex = "(>[^<]*?)((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})|(\d{6,16})([;,\.]{1,3}\d{3,}#?)?)([^<]*<)";
String text = "<span style=\"font-size:11.0pt\">675-441-3144;;;78888464#<o:p></o:p></span>":
String replacement = text.replaceAll(regex, $1<a href="tel:$2">$2</a>$11");

I am expecting to get the following result: 我期望得到以下结果:

<span style=\"font-size:11.0pt\"><a href=\"tel:675-441-3144;;;78888464#\">675-441-3144;;;78888464#</a><o:p></o:p></span>

But the $11 backreference is not returning the 11th group, it is returning the first group with a 1 attached to it, and instead I am getting the following result: 但是$ 11的反向引用没有返回第11个组,而是返回了第一个附加了1的组,相反,我得到了以下结果:

<span style="font-size:11.0pt"><a href="tel:675-441-3144">675-441-3144</a>>1o:p></o:p></span>

Can someone please tell me how to access the eleventh group of my pattern? 有人可以告诉我如何访问我的模式的第11组吗?

Thanks. 谢谢。

Short Answer 简短答案

The way you access the eleventh group of a match in the replacement is with $11 . 访问替换中比赛的第十一组的方式是使用$11

Explanation: 说明:

As the corresponding Javadoc * states: 相应的Javadoc *所述:

The replacement string may contain references to subsequences captured during the previous match: Each occurrence of ${name} or $g will be replaced by the result of evaluating the corresponding group(name) or group(g) respectively. 替换字符串可能包含对先前匹配过程中捕获的子序列的引用: ${name}$g每次出现都将被分别评估相应group(name)group(g)的结果替换。 For $g , the first number after the $ is always treated as part of the group reference. 对于$g ,在之后的第一个数字$始终被视为该组参考的一部分。 Subsequent numbers are incorporated into g if they would form a legal group reference. 如果后续数字将构成合法的组引用,则将其合并到g

So generally speaking, as long as have at least eleven groups, then "$11" will evaluate to group(11) . 因此,一般来讲,只要至少有11个组,则"$11"将评估为group(11) However, if you do not have at least eleven groups, then "$11" will evaluate to group(1) + "1" . 但是,如果您没有至少11个组,则"$11"将计算为group(1) + "1"

* This quote is from Matcher#appendReplacement(StringBuffer,String) , which is where the chain of relevant citations from String#replaceAll(String,String) leads to. * 此引用来自Matcher#appendReplacement(StringBuffer,String) ,这是来自String#replaceAll(String,String)的相关引用链的所在。


Actual Answer 实际答案

Your regex does not do what you think it does. 您的正则表达式不会执行您认为的操作。

Part 1 第1部分

The Problem 问题

Let's divide your regex into its three top-level groups. 让我们将正则表达式分为三个顶级组。 These are groups 1, 2, and 11, respectively. 它们分别是组1、2和11。

  • Group 1: 第一组:
    (>[^<]*?)
  • Group 2: 第2组:
    ((\\+?\\d{1,4}[ \\t\\f\\-\\.](\\d[ \\t\\f\\-\\.])?)?(\\(\\d{1,4}([\\s-]\\d{1,4})?\\)[\\.\\- \\t\\f])?((\\d{2,6}[\\.\\- \\t\\f])+\\d{2,6})|(\\d{6,16})([;,\\.]{1,3}\\d{3,}#?)?)
  • Group 11: 第11组:
    ([^<]*<)

Group 2 is the main body of your regex, and it consists of a top-level alternation over two options. 第2组是您的正则表达式的主体,它由两个选项的顶级交替组成。 These two options consist of groups 3-8 and 9-10, respectively. 这两个选项分别由3-8组和9-10组组成。

  • First option: 第一种选择:
    ((\\+?\\d{1,4}[ \\t\\f\\-\\.](\\d[ \\t\\f\\-\\.])?)?(\\(\\d{1,4}([\\s-]\\d{1,4})?\\)[\\.\\- \\t\\f])?((\\d{2,6}[\\.\\- \\t\\f])+\\d{2,6})
  • Second option: 第二种选择:
    (\\d{6,16})([;,\\.]{1,3}\\d{3,}#?)?)

Now, given the text string, here is what is going on: 现在,给定text字符串,这是怎么回事:

  1. Group 1 executes. 组1执行。 It matches the first ">" . 它与第一个">"相匹配。
  2. Group 2 executes. 第2组执行。 It evaluates the options of its alternation in order. 它按顺序评估其交替的选项。
    1. The first option of group 2's alternation executes. 执行第2组交替的第一个选项。 It matches "675-441-3144" . 匹配"675-441-3144"
    2. Group 2's alternation successfully short-circuits upon the match of one of its options. 第2组的交替在其选项之一匹配时成功短路。
      • Group 2 as a whole is now equal to the option that matched, which is "675-441-3144" . 现在,第2组整体等于匹配的选项,即"675-441-3144"
      • The cursor is now positioned immediately after "675-441-3144" , which is immediately before ";;;78888464#" . 现在将光标定位在紧跟在"675-441-3144" ";;;78888464#"之前的";;;78888464#"
  3. Group 11 executes. 第11组执行。 It matches everything up through the next "<" , which is all of ";;;78888464#<" . 它通过下一个"<"匹配所有内容;下一个"<"是所有";;;78888464#<"

Thus, some of the content that you want to be in group 2 is actually in group 11 instead. 因此,您希望放在第2组中的某些内容实际上是在第11组中。

The Solution 解决方案

Do both of the following two things: 请同时执行以下两项操作:

  • Convert the contents of group 2 from 将第2组的内容转换为

     option1|option2 

    to

     option1(option2)?|option2 
  • Change $11 in your replacement pattern to $12 . 将替换模式中的$11更改$11 $12

This will greedy match one or both options, rather than only one option. 这会使贪婪地匹配一个或两个选项,而不是只有一个选项。 The modification to the replacement pattern is because we have added a group. 替换模式的修改是因为我们添加了一个组。

Part 2 第2部分

The Problem 问题

Now that we have modified the regex, our original "option 2" no longer makes sense. 现在,我们已经修改了正则表达式,原来的“选项2”不再有意义。 Given our new pattern template option1(option2)?|option2 , it will be impossible for group 2 to match "675-441-3144;;;78888464#" . 给定我们新的模式模板option1(option2)?|option2 ,第2组将不可能匹配"675-441-3144;;;78888464#" This is because our original "option 1" will match all of "675-441-3144" and then stop. 这是因为我们原来的“选项1”将匹配所有"675-441-3144" ,然后停止。 Our original "option 2" will then attempt to match ";;;78888464#" , but will be unable to because it begins with a mandatory capture group of 6-10 digits: (\\d{6,16}) , but ";;;78888464#" begins with a semicolon. 然后,我们原始的“选项2”将尝试匹配";;;78888464#" ,但将无法匹配,因为它以6-10位数字的强制捕获组开头: (\\d{6,16}) ,但";;;78888464#"以分号开头。

The Solution 解决方案

Convert the contents of our original "option 2" from 将原始“选项2”的内容转换为

(\d{6,16})([;,\.]{1,3}\d{3,}#?)?

to

([;,\.]{1,3}\d{3,}#?)?

Part 3 第三部分

The Problem 问题

We have one final problem to solve. 我们还有最后一个问题要解决。 Now that our original "option 2" consists only of a single group with the ? 现在,我们原来的“选项2”仅包含一个带有?? quantifier, it is possible for it to successfully match a zero-length substring. 量词,它有可能成功匹配零长度子串。 So our pattern template option1(newoption2)?|newoption2 could result in a zero-length match, which does not fulfill the intended purpose of matching phone numbers. 因此,我们的模式模板option1(newoption2)?|newoption2可能会导致长度为零的匹配,这不能满足匹配电话号码的预期目的。

The Solution 解决方案

Do both of the following: 请执行以下两个操作:

  • Convert the contents of our new "option 2" from 将新的“选项2”的内容转换为

    ([;,.]{1,3}\\d{3,}#?)? ([;,。] {1,3} \\ d {3,}#?)?

    to

    [;,.]{1,3}\\d{3,}#? [;,。] {1,3} \\ d {3,}#?

  • Change $12 in our replacement string to $10 , since we have now removed one group in two locations. 将替换字符串中的$12更改$12 $10 ,因为现在我们已在两个位置删除了一个组。


The Final Solution 最终的解决方案

Putting everything together, our final solution is as follows. 综上所述,我们最终的解决方案如下。

Search regex: 搜索正则表达式:

(>[^<]*?)((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})([;,\.]{1,3}\d{3,}#?)?|[;,\.]{1,3}\d{3,}#?)([^<]*<)

Replacement regex: 替换正则表达式:

$1<a href="tel:$2">$2</a>$10

Java: Java:

final String searchRegex = "(>[^<]*?)((\\+?\\d{1,4}[ \\t\\f\\-\\.](\\d[ \\t\\f\\-\\.])?)?(\\(\\d{1,4}([\\s-]\\d{1,4})?\\)[\\.\\- \\t\\f])?((\\d{2,6}[\\.\\- \\t\\f])+\\d{2,6})([;,\\.]{1,3}\\d{3,}#?)?|[;,\\.]{1,3}\\d{3,}#?)([^<]*<)";
final String replacementRegex = "$1<a href=\"tel:$2\">$2</a>$10";

String text = "<span style=\"font-size:11.0pt\">675-441-3144;;;78888464#<o:p></o:p></span>";
String replacement = text.replaceAll(searchRegex, replacementRegex);

Proof of correctness 正确性证明

Well, after trying to do it with replaceall without success, I had to implement the replacement method by myself: 好吧,在尝试使用replaceall而不成功之后,我不得不自己实现替换方法:

public static String parsePhoneNumbers(String html){
    StringBuilder regex = new StringBuilder(120);
    regex.append("(>[^<]*?)(")
       .append("((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?")
       .append("(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?")
       .append("((\d{2,6}[\.\- \t\f])+\d{2,6})|(\d{6,16})")
       .append("([;,\.]{1,3}\d{3,}#?)?)") 
       .append(")+([^<]*<)");

    StringBuilder mutableHtml = new StringBuilder(html.length());
    Pattern pattern = Pattern.compile(regex.toString());
    Matcher matcher = pattern.matcher(html);
    int start = 0;

    while(matcher.find()){
        mutableHtml.append(html.substring(start, matcher.start()));
        mutableHtml.append(matcher.group(1)).append("<a href=\"tel:")
                .append(matcher.group(2)).append("\">").append(matcher.group(2))
                .append("</a>").append(matcher.group(matcher.groupCount()));
        start = matcher.end();

    }
    mutableHtml.append(html.substring(start));
    return mutableHtml.toString();
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM