简体   繁体   English

带有后引用的Java String.replaceAll()

[英]Java String.replaceAll() with back reference

There is a Java Regex question: Given a string, if the "*" is at the start or the end of the string, keep it, otherwise, remove it. 有一个Java Regex问题:给定一个字符串,如果“*”位于字符串的开头或结尾,请保留它,否则删除它。 For example: 例如:

  1. * --> * * - > *
  2. ** --> ** ** - > **
  3. ******* --> ** ******* - > **
  4. *abc**def* --> *abcdef* *abc**def* - > *abcdef*

The answer is: 答案是:

str.replaceAll("(^\\*)|(\\*$)|\\*", "$1$2");

I tried the answer on my machine and it works. 我在我的机器上尝试了答案,但它确实有效。 But I don't know how it works. 但我不知道它是如何工作的。

From my understanding, all matched substrings should be replaced with $1$2 . 根据我的理解,所有匹配的子串应该用$1$2替换。 However, it works as: 但是,它的工作原理如下:

  1. (^\\\\*) replaced with $1 , (^\\\\*)替换为$1
  2. (\\\\*$) replaced with $2 , (\\\\*$)替换为$2
  3. \\\\* replaced with empty. \\\\*换成空。

Could someone explain how it works? 有人能解释它是如何工作的吗? More specifically, if there is | 更具体地说,如果有| between expressions, how String.replaceAll() works with back reference? 表达式之间, String.replaceAll()如何与后向引用一起工作?

Thank you in advance. 先感谢您。

I'll try to explain what's happening in regex. 我将尝试解释正则表达式中发生的事情。

str.replaceAll("(^\\*)|(\\*$)|\\*", "$1$2");

$1 represents first group which is (^\\\\*) $2 represents 2nd group (\\\\*$) $1表示第一组是(^\\\\*) $2表示第二组(\\\\*$)

when you call str.replaceAll , you are essentially capturing both groups and everything else but when replacing, replace captured text with whatever got captured in both groups. 当你调用str.replaceAll ,你实际上是在捕获两个组以及其他所有内容,但在替换时,将捕获的文本替换为两个组中捕获的内容。

Example: *abc**def* --> *abcdef* 示例: *abc**def* --> *abcdef*

Regex is found string starting with * , it will put in $1 group, next it will keep looking until it find * at end of group and store it in #2 . 正则表达式是以*开头的字符串,它将放入$1组中,接下来它将继续查找,直到它在组的末尾找到*并将其存储在#2 now when replacing it will eliminate all * except one stored in $1 or $2 现在当更换它时将消除所有*除了存储在$1$2所有*

For more information see Capture Groups 有关更多信息,请参阅捕获组

You can use lookarounds in your regex: 您可以在正则表达式中使用外观:

String repl = str.replaceAll("(?<!^)\\*+(?!$)", "");

RegEx Demo RegEx演示

RegEx Breakup: RegEx分手:

(?<!^)   # If previous position is not line start
\\*+     # match 1 or more *
(?!$)    # If next position is not line end

OP's regex is: OP的正则表达式是:

(^\*)|(\*$)|\*

It uses 2 captured groups, one for * at start and another for * at end and uses back-references in replacements. 它使用2个捕获的组,一个用于*开始,另一个用于*结尾,并在替换中使用反向引用。 Which might work here but will be way more slower to finish for larger string as evident in # of steps taken in this demo . 这可能在这里起作用,但是对于更大的字符串来说会更慢,这在本演示中采取的步骤中很明显。 That is 209 vs 48 steps using look-arounds. 使用环视是209对48步。

Another smaller improvement in OP's regex is to use quantifier : OP正则表达式的另一个小改进是使用量词

(^\*)|(\*$)|\*+

Well, let's first take a look at your regex (^\\\\*)|(\\\\*$)|\\\\* - it matches every * , if it is at the start, it is captured into group 1, if it is at the end, it is captured into group 2 - every other * is matched, but not put into any group. 好吧,让我们先来看看你的正则表达式(^\\\\*)|(\\\\*$)|\\\\* - 它匹配每个* ,如果它在开始时,它被捕获到组1中,如果它是最后,它被捕获到第2组 - 每个其他*匹配,但不会被放入任何组。

The Replace pattern $1$2 replaces every single match with the content of group 1 and group 2 - so in case of a * at the beginning or the end of a match, the content of one of the groups is that * itself and is therefore replaced by itself. 替换模式$ 1 $ 2将每个匹配替换为组1和组2的内容 - 因此,如果在匹配的开头或结尾处有* ,则其中一个组的内容是*本身,因此被替换通过它自己。 For all the other matches, the groups contain only empty strings, so the matched * is replaced with this empty string. 对于所有其他匹配,组仅包含空字符串,因此匹配的*将替换为此空字符串。

Your problem was probably that $1$2 is not a literal replace, but a backreference to captured groups. 您的问题可能是$ 1 $ 2不是字面替换,而是对捕获的组的反向引用。

According to the Javadoc: 根据Javadoc:

Note that backslashes () and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string; 请注意,替换字符串中的反斜杠()和美元符号($)可能导致结果与将其视为文字替换字符串时的结果不同; see Matcher.replaceAll. 见Matcher.replaceAll。 Use Matcher.quoteReplacement(java.lang.String) to suppress the special meaning of these characters, if desired. 如果需要,使用Matcher.quoteReplacement(java.lang.String)来抑制这些字符的特殊含义。

Your regex: "(^\\\\*)|(\\\\*$)|\\\\*" 你的正则表达式: "(^\\\\*)|(\\\\*$)|\\\\*"

After removing quotes and String escapes: (^\\*)|(\\*$)|\\* 删除引号和String转义后: (^\\*)|(\\*$)|\\*

There are three parts, separated by pipes | 有三个部分,通过管道分离| . The pipes mean OR, which means that replaceAll() replaces them with the stuff from the second part: $1$2 . 管道意味着OR,这意味着replaceAll()用第二部分中的东西替换它们: $1$2 Essentially, the 1st part >> $1, the second >> $2, the third >> "" . 基本上,第1部分>> $ 1,第二部分>> $ 2,第三部分>> "" Note that "the 1st part" == $1, and so on... So it's not technically replaced. 注意“第一部分”== $ 1,依此类推......所以它在技术上没有被替换。

1 (^\\*) is a capture group (the first). 1 (^\\*)是一个捕获组(第一个)。 ^ anchors to the string start. ^锚点到字符串开始。 \\* matches * , but needs the escape \\ . \\*匹配* ,但需要转义\\

2 (\\*$) again, a capture group (2nd one). 2 (\\*$)再次,一个捕获组(第二个)。 Difference here is it anchors to the end with $ 这里的区别在于它以$结尾

3 \\* like before, matches a literal * 3 \\*像以前一样,匹配文字*

The thing you need to understand about regexes is it will always take the first path if it matches. 关于正则表达式需要理解的是,如果它匹配,它将始终采用第一条路径。 While * s at the beginning and end of the string could be matched by the 3rd part, they match the first or second parts instead. 虽然字符串开头和结尾的* s可以与第三部分匹配,但它们匹配第一或第二部分。

Others have given very good answers so I won't repeat them. 其他人给出了非常好的答案,所以我不再重复。 A suggestion when you are working to understand issues such as this is to temporarily add delimiters to the replacement string so that it is clear what is happening at each stage. 当您正在努力理解诸如此类的问题时,建议暂时将替换字符串添加到替换字符串中,以便清楚每个阶段发生的情况。

eg use "<$1|$2>" This will give results of <x|y> where x is $1 and y is $2 例如,使用"<$1|$2>"这将得到<x|y>结果,其中x是$ 1,y是$ 2

String str = "*ab**c*d*";
str.replaceAll("(^\\*)|(\\*$)|\\*", "<$1|$2>");

The result is: <*|>ab<|><|>c<|>d<|*> 结果是: <*|>ab<|><|>c<|>d<|*>

So for the first asterisk, $1 = * and $2 is empty because (^\\\\*) matches. 所以对于第一个星号,$ 1 = *和$ 2是空的,因为(^\\\\*)匹配。

For mid-string asterisks, both $1 and $2 are empty because neither capturing group matches. 对于中间字符串星号,$ 1和$ 2都是空的,因为两个捕获组都不匹配。

For the final asterisk, $1 is empty and $2 is * because (^\\\\*) does not match but (\\\\*$) does. 对于最终的星号,$ 1为空,$ 2为*因为(^\\\\*)不匹配,但是(\\\\*$)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM