简体   繁体   中英

Java String.replaceAll() with back reference

There is a Java Regex question: Given a string, if the "*" is at the start or the end of the string, keep it, otherwise, remove it. For example:

  1. * --> *
  2. ** --> **
  3. ******* --> **
  4. *abc**def* --> *abcdef*

The answer is:

str.replaceAll("(^\\*)|(\\*$)|\\*", "$1$2");

I tried the answer on my machine and it works. But I don't know how it works.

From my understanding, all matched substrings should be replaced with $1$2 . However, it works as:

  1. (^\\\\*) replaced with $1 ,
  2. (\\\\*$) replaced with $2 ,
  3. \\\\* replaced with empty.

Could someone explain how it works? More specifically, if there is | between expressions, how String.replaceAll() works with back reference?

Thank you in advance.

I'll try to explain what's happening in regex.

str.replaceAll("(^\\*)|(\\*$)|\\*", "$1$2");

$1 represents first group which is (^\\\\*) $2 represents 2nd group (\\\\*$)

when you call str.replaceAll , you are essentially capturing both groups and everything else but when replacing, replace captured text with whatever got captured in both groups.

Example: *abc**def* --> *abcdef*

Regex is found string starting with * , it will put in $1 group, next it will keep looking until it find * at end of group and store it in #2 . now when replacing it will eliminate all * except one stored in $1 or $2

For more information see Capture Groups

You can use lookarounds in your regex:

String repl = str.replaceAll("(?<!^)\\*+(?!$)", "");

RegEx Demo

RegEx Breakup:

(?<!^)   # If previous position is not line start
\\*+     # match 1 or more *
(?!$)    # If next position is not line end

OP's regex is:

(^\*)|(\*$)|\*

It uses 2 captured groups, one for * at start and another for * at end and uses back-references in replacements. Which might work here but will be way more slower to finish for larger string as evident in # of steps taken in this demo . That is 209 vs 48 steps using look-arounds.

Another smaller improvement in OP's regex is to use quantifier :

(^\*)|(\*$)|\*+

Well, let's first take a look at your regex (^\\\\*)|(\\\\*$)|\\\\* - it matches every * , if it is at the start, it is captured into group 1, if it is at the end, it is captured into group 2 - every other * is matched, but not put into any group.

The Replace pattern $1$2 replaces every single match with the content of group 1 and group 2 - so in case of a * at the beginning or the end of a match, the content of one of the groups is that * itself and is therefore replaced by itself. For all the other matches, the groups contain only empty strings, so the matched * is replaced with this empty string.

Your problem was probably that $1$2 is not a literal replace, but a backreference to captured groups.

According to the Javadoc:

Note that backslashes () and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string; see Matcher.replaceAll. Use Matcher.quoteReplacement(java.lang.String) to suppress the special meaning of these characters, if desired.

Your regex: "(^\\\\*)|(\\\\*$)|\\\\*"

After removing quotes and String escapes: (^\\*)|(\\*$)|\\*

There are three parts, separated by pipes | . The pipes mean OR, which means that replaceAll() replaces them with the stuff from the second part: $1$2 . Essentially, the 1st part >> $1, the second >> $2, the third >> "" . Note that "the 1st part" == $1, and so on... So it's not technically replaced.

1 (^\\*) is a capture group (the first). ^ anchors to the string start. \\* matches * , but needs the escape \\ .

2 (\\*$) again, a capture group (2nd one). Difference here is it anchors to the end with $

3 \\* like before, matches a literal *

The thing you need to understand about regexes is it will always take the first path if it matches. While * s at the beginning and end of the string could be matched by the 3rd part, they match the first or second parts instead.

Others have given very good answers so I won't repeat them. A suggestion when you are working to understand issues such as this is to temporarily add delimiters to the replacement string so that it is clear what is happening at each stage.

eg use "<$1|$2>" This will give results of <x|y> where x is $1 and y is $2

String str = "*ab**c*d*";
str.replaceAll("(^\\*)|(\\*$)|\\*", "<$1|$2>");

The result is: <*|>ab<|><|>c<|>d<|*>

So for the first asterisk, $1 = * and $2 is empty because (^\\\\*) matches.

For mid-string asterisks, both $1 and $2 are empty because neither capturing group matches.

For the final asterisk, $1 is empty and $2 is * because (^\\\\*) does not match but (\\\\*$) does.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM