简体   繁体   English

String.replaceAll(regex) 进行两次相同的替换

[英]String.replaceAll(regex) makes the same replacement twice

Can anyone tell me why谁能告诉我为什么

System.out.println("test".replaceAll(".*", "a"));

Results in结果是

aa

Note that the following has the same result:请注意,以下具有相同的结果:

System.out.println("test".replaceAll(".*$", "a"));

I have tested this on java 6 & 7 and both seem to behave the same way.我已经在 java 6 和 7 上对此进行了测试,两者的行为似乎相同。 Am I missing something or is this a bug in the java regex engine?我是否遗漏了什么,或者这是 java regex 引擎中的错误?

This is not an anomaly: .* can match anything.这不是异常: .*可以匹配任何东西。

You ask to replace all occurrences:您要求替换所有出现:

  • the first occurrence does match the whole string, the regex engine therefore starts from the end of input for the next match;第一次出现匹配整个字符串,因此正则表达式引擎从输入的末尾开始进行下一次匹配;
  • but .* also matches an empty string!但是.*也匹配一个空字符串! It therefore matches an empty string at the end of the input, and replaces it with a .因此,它匹配输入末尾的空字符串,并将其替换a .

Using .+ instead will not exhibit this problem since this regex cannot match an empty string (it requires at least one character to match).使用.+代替不会出现此问题,因为此正则表达式无法匹配空字符串(它需要至少一个字符才能匹配)。

Or, use .replaceFirst() to only replace the first occurrence:或者,使用.replaceFirst()仅替换第一次出现:

"test".replaceFirst(".*", "a")
       ^^^^^^^^^^^^

Now, why .* behaves like it does and does not match more than twice (it theoretically could) is an interesting thing to consider.现在,为什么.*表现得像它一样并且不匹配超过两次(理论上可以)是一件值得考虑的有趣事情。 See below:见下文:

# Before first run
regex: |.*
input: |whatever
# After first run
regex: .*|
input: whatever|
#before second run
regex: |.*
input: whatever|
#after second run: since .* can match an empty string, it it satisfied...
regex: .*|
input: whatever|
# However, this means the regex engine matched an empty input.
# All regex engines, in this situation, will shift
# one character further in the input.
# So, before third run, the situation is:
regex: |.*
input: whatever<|ExhaustionOfInput>
# Nothing can ever match here: out

Note that, as @AH notes in the comments, not all regex engines behave this way.请注意,正如@AH 在评论中指出的那样,并非所有正则表达式引擎都以这种方式运行。 GNU sed for instance will consider that it has exhausted the input after the first match.例如,GNU sed会认为它在第一次匹配后已经用尽了输入。

The accepted answer hasn't shown this yet, so here is an alternative way to fix your regex:接受的答案尚未显示这一点,因此这是修复正则表达式的另一种方法:

System.out.println("test".replaceAll("^.*$", "a"));

Note, I'm using both terminators: ^ and $ .注意,我同时使用了两个终结符: ^$ The $ isn't strictly necessary for this particular case, but I find adding both least cryptic.对于这种特殊情况, $并不是绝对必要的,但我发现添加两者最不神秘。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM