简体   繁体   English

.net Regex.Replace中的错误?

[英]Bug in .net Regex.Replace?

The following code... 以下代码......

using System;
using System.Text.RegularExpressions;

public class Program
{
    public static void Main()
    {
        var r = new Regex("(.*)");
        var c = "XYZ";
        var uc = r.Replace(c, "A $1 B");

        Console.WriteLine(uc);
    }
}

.Net Fiddle Link .Net小提琴链接

produces the following output... 产生以下输出......

A XYZ BA B 一个XYZ BA B.

Do you think this is correct? 你认为这是对的吗?

Shouldn't the output be... 输出不应该......

A XYZ B 一个XYZ B.

I think I am doing something stupid here. 我想我在这里做了些蠢事。 I would appreciate any help you can provide in helping me understand this issue. 如能帮助我理解这个问题,我将不胜感激。


Here is something interesting... 这是有趣的事情......

using System;
using System.Text.RegularExpressions;

public class Program
{
    public static void Main()
    {
        var r = new Regex("(.*)");
        var c = "XYZ";
        var uc = r.Replace(c, "$1");

        Console.WriteLine(uc);
    }
}

.Net Fiddle .Net小提琴

Output... 输出...

XYZ XYZ

As for why the engine returns 2 matches, it is due to the way .NET (also Perl and Java) handles global matching, ie find all matches to the given pattern in an input string. 至于为什么引擎返回2个匹配,这是由于.NET(也是Perl和Java)处理全局匹配的方式,即在输入字符串中查找给定模式的所有匹配。

The process can be described as followed (current index is usually set to 0 at the beginning of a search, unless specified): 该过程可以描述如下(当前索引通常在搜索开始时设置为0,除非指定):

  1. From the current index, perform a search. 从当前索引执行搜索。
  2. If there is no match: 如果没有匹配:
    1. If current index already points at the end of the string (current index >= string.length), return the result so far. 如果当前索引已经指向字符串的末尾(当前索引> = string.length),则返回到目前为止的结果。
    2. Increment current index by 1, go to step 1. 将当前索引增加1,转到步骤1。
  3. If the main match ( $0 ) is non-empty (at least one character is consumed), add the result and set current index to the end of main match ( $0 ). 如果主匹配( $0 )非空(消耗了至少一个字符),则添加结果并将当前索引设置为主匹配( $0 )的结尾。 Then go to step 1. 然后转到第1步。
  4. If the main match ( $0 ) is empty: 如果主匹配( $0 )为空:
    1. If the previous match is non-empty, add the result and go to step 1. 如果上一个匹配项为非空,请添加结果并转到步骤1。
    2. If the previous match is empty, backtrack and continue searching. 如果上一个匹配为空,则回溯并继续搜索。
    3. If the backtracking attempt finds a non-empty match, add the result, set current index to the end of the match and go to step 1. 如果回溯尝试找到非空匹配,则添加结果,将当前索引设置为匹配结束并转到步骤1。
    4. Otherwise, increment current index by 1. Go to step 1. 否则,将当前索引增加1.转到步骤1。

The engine needs to check for empty match; 引擎需要检查空匹配; otherwise, it will end up in an infinite loop. 否则,它将以无限循环结束。 The designer recognizes the usage of empty match (in splitting a string into characters, for example), so the engine must be designed to avoid getting stuck at a certain position forever. 设计者识别空匹配的使用(例如,将字符串拆分为字符),因此必须设计引擎以避免永远卡在某个位置。

This process explains why there is an empty match at the end: since a search is conducted at the end of the string (index 3) after (.*) matches abc , and (.*) can match an empty string, an empty match is found. 此过程解释了为什么在结尾处存在空匹配:因为在(.*)匹配abc之后在字符串的末尾(索引3)进行搜索,并且(.*)可以匹配空字符串,空匹配找到了。 And the engine does not produce infinite number of empty matches, since an empty match has already been found at the end. 并且引擎不会产生无限数量的空匹配,因为最后已经找到空匹配。

 a b c
^ ^ ^ ^
0 1 2 3

First match: 第一场比赛:

 a b c
^     ^
0-----3

Second match: 第二场比赛:

 a b c
      ^
      3

With the global matching algorithm above, there can only be at most 2 matches starting at the same index, and such case can only happen when the first one is an empty match. 使用上面的全局匹配算法,从同一索引开始只能有最多2个匹配,并且这种情况只能在第一个匹配为空匹配时发生。

Note that JavaScript simply increment current index by 1 if the main match is empty, so there is at most 1 match per index. 请注意,如果主匹配为空,JavaScript只会将当前索引增加1,因此每个索引最多匹配1个匹配项。 However, in this case (.*) , if you use global flag g to do global matching, the same result would happen: 但是,在这种情况下(.*) ,如果使用全局标志g进行全局匹配,则会发生相同的结果:

(Result below is from Firefox, note the g flag) (下面的结果来自Firefox,请注意g标志)

> "XYZ".replace(/(.*)/g, "A $1 B")
"A XYZ BA  B"

I'll have to contemplate why this happens. 我不得不考虑为什么会这样。 Am sure you're missing something. 我相信你错过了什么。 Though this fix the problem. 虽然这解决了这个问题。 Just anchor the regex. 只需锚定正则表达式。

var r = new Regex("^(.*)$");

Here's the .NetFiddle demo 这是.NetFiddle演示

You regex has two matches and Replace will replace both of them. 你的正则表达式有两个匹配,替换将替换它们。 The first is "XYZ" and the second is an empty string. 第一个是“XYZ”,第二个是空字符串。 What I'm not sure of is why it has two matches in the first place. 我不确定的是为什么它首先有两场比赛。 You can fix it with ^(.*)$ to force it to consider the beginning and end of the string. 您可以使用^(。*)$来修复它,以强制它考虑字符串的开头和结尾。

Or use + instead of * to force it to match at least one character . 或者使用+代替*来强制它匹配至少一个字符

.* matches an empty string because it has zero characters. .*匹配一个空字符串,因为它有零个字符。

.+ does not match an empty string because it requires at least one character. .+与空字符串不匹配,因为它至少需要一个字符。

Interestingly, in Javascript (in Chrome): 有趣的是,在Javascript(在Chrome中):

var r = /(.*)/;
var s = "XYZ";
console.log(s.replace(r,"A $1 B");

Will output the expected A XYZ B without the spurious extra match. 将输出预期的A XYZ B而没有虚假的额外匹配。

Edit (thanks to @nhahtdh): but adding the g flag to the Javascript regex, give you the same result as in .NET: 编辑(感谢@nhahtdh):但是将g标志添加到Javascript正则表达式中,给出与.NET中相同的结果:

var r = /(.*)/g;
var s = "XYZ";
console.log(s.replace(r,"A $1 B");

The * quantifier matches 0 or more. *量词匹配0或更多。 This causes there to be 2 matches. 这导致有2场比赛。 XYZ and nothing. XYZ什么都没有。

Try the + quantifier instead which matches 1 or more. 尝试+量数,而不是匹配1或更多。

A plain explanation would be to look at the string like this: XYZ<nothing> 一个简单的解释是看这样的字符串: XYZ<nothing>

  1. We have the matches XYZ and <nothing> 我们有匹配XYZ<nothing>
  2. For each match 每场比赛
    • Match 1: Replace XYZ with A $1 B ($1 is here XYZ ) Result: A XYZ B 匹配1:将XYZ替换为A $1 B ($ 1在这里XYZ )结果: A XYZ B
    • Match 2: Replace <nothing> with A $1 B ($1 is here <nothing> ) Result: AB 匹配2:用A $1 B替换<nothing> ($ 1在这里<nothing> )结果: AB

End result: A XYZ BA B 最终结果: A XYZ BA B

Why <nothing> is a match by itself is interesting and something I haven't really thought much about. 为什么<nothing>本身就是一个有趣的东西,而且我还没有真正想过的东西。 (Why aren't there infinite <nothing> matches?) (为什么没有无限的<nothing>匹配?)

Regex is a peculiar language. 正则表达式是一种特殊的语言。 You have to understand exactly what (.*) is going to match. 你必须准确理解什么(。*)将匹配。 You also need to understand greediness. 你还需要了解贪婪。

  • (.*) will greedily match 0 or more characters. (。*)将贪婪地匹配0个或更多字符。 So, in the string "XYZ" , it will match the entire string with its first match and place it in the $1 position, giving you this: 因此,在字符串"XYZ" ,它将匹配整个字符串与其第一个匹配并将其放在$ 1位置,为您提供:

    A XYZ B It will then continue to try to match and match null at the end of the string, setting your $1 to null, giving you this: 一个XYZ B它将继续尝试匹配并匹配字符串末尾的null ,将$ 1设置为null,为您提供:

    AB Resulting in the string you are seeing: AB导致您看到的字符串:

    A XYZ BA B 一个XYZ BA B.

  • If you were to want to limit the greediness and match each character, you would use this expression: 如果您想要限制贪婪并匹配每个字符,您可以使用以下表达式:

    (.*?) (。*?)
    This would match each character X, Y, and Z separately, as well as null at the end and result in this: 这将分别匹配每个字符X,Y和Z,并在null处为null ,并导致:

    A BXA BYA BZA B BXA BYA BZA B

If you do not want your regex to exceed the bounds of your given string, then limit your regex with ^ and $ identifiers. 如果您不希望正则表达式超出给定字符串的范围,请使用^$标识符限制正则表达式。

To give you a better perspective of what is happening, consider this test and the resulting matching groups. 为了更好地了解正在发生的事情,请考虑此测试以及生成的匹配组。

    [TestMethod()]
    public void TestMethod3()
    {
        var myText = "XYZ";
        var regex = new Regex("(.*)");
        var m = regex.Match(myText);
        var matchCount = 0;
        while (m.Success)
        {
            Console.WriteLine("Match" + (++matchCount));
            for (int i = 1; i <= 2; i++)
            {
                Group g = m.Groups[i];
                Console.WriteLine("Group" + i + "='" + g + "'");
                CaptureCollection cc = g.Captures;
                for (int j = 0; j < cc.Count; j++)
                {
                    Capture c = cc[j];
                    Console.WriteLine("Capture" + j + "='" + c + "', Position=" + c.Index);
                }
            }
            m = m.NextMatch();
        }

Output: 输出:

Match1
Group1='XYZ'
Capture0='XYZ', Position=0
Group2=''
Match2
Group1=''
Capture0='', Position=3
Group2=''

Notice that there are two Groups that matched. 请注意,有两个匹配的组。 The first was the entire group XYZ, and the second was an empty group. 第一个是整个组XYZ,第二个是空组。 Nevertheless, there were two groups matched. 然而,有两组相匹配。 So the $1 was swapped out for XYZ in the first case and with null for the second. 因此,在第一种情况下,1美元换成XYZ,第二种情况换成null

Also note, the forward slash / is just another character considered in the .net regex engine and has no special meaning. 另请注意,正斜杠/只是.net正则表达式引擎中考虑的另一个字符,没有特殊含义。 The javascript parser handles / differently because it must because it exists in the framework of HTML parsers where </ is a special consideration. javascript解析器处理/不同,因为它必须因为它存在于HTML解析器的框架中</是一个特殊的考虑因素。

Finally, to get what you actually desire, consider this test: 最后,要获得您真正想要的东西,请考虑以下测试:

    [TestMethod]
    public void TestMethod1()
    {
        var r = new Regex(@"^(.*)$");
        var c = "XYZ";
        var uc = r.Replace(c, "A $1 B");

        Assert.AreEqual("A XYZ B", uc);
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM