简体   繁体   English

为什么C#不遵循我的正则表达式?

[英]Why isn't C# following my regex?

I have a C# application that reads a word file and looks for words wrapped in < brackets > 我有一个C#应用程序,该应用程序读取一个词文件并查找用<方括号>包裹的词

It's currently using the following code and the regex shown. 当前正在使用以下代码和所示的正则表达式。

 private readonly Regex _regex = new Regex("([<])([^>]*)([>])", RegexOptions.Compiled);

I've used several online testing tools / friends to validate that the regex works, and my application proves this (For those playing at home, http://wordfiller.codeplex.com )! 我已经使用了几个在线测试工具/朋友来验证正则表达式是否有效,并且我的应用程序证明了这一点(对于那些在家玩的人来说, http://wordfiller.codeplex.com )!

My problem is however the regex will also pickup extra rubbish. 我的问题是,正则表达式也会拾取多余的垃圾。

EG 例如

I'm walking on <sunshine>.

will return 将返回

sunshine>.

it should just return 它应该返回

<sunshine>

Anyone know why my application refuses to play by the rules? 有人知道为什么我的申请拒绝遵守规则吗?

I don't think the problem is your regex at all. 我认为问题根本不是您的正则表达式。 It could be improved somewhat -- you don't need the ([]) around each bracket -- but that shouldn't affect the results. 它可以有所改进-您不需要在每个括号中都使用([]) -但这不会影响结果。 My strong suspicion is that the problem is in your C# implementation, not your regex. 我强烈怀疑问题出在您的C#实现中,而不是您的正则表达式中。

Your regex should split <sunshine> into three separate groups: < , sunshine , and > . 您的正则表达式应将<sunshine>分为三个独立的组: <sunshine> Having tested it with the code below, that's exactly what it does. 使用下面的代码对其进行了测试,这就是它的功能。 My suspicion is that, somewhere in the C# code, you're appending Group 3 to Group 2 without realizing it. 我的怀疑是,您在C#代码中的某个位置将第3组附加到第2组,而没有意识到。 Some quick C# experimentation supports this: 一些快速的C#实验支持:

private readonly Regex _regex = new Regex("([<])([^>]*)([>])", RegexOptions.Compiled);
private string sunshine()
{
    string input = "I'm walking on <sunshine>.";
    var match = _regex.Match(input);
    var regex2 = new Regex("<[^>]*>", RegexOptions.Compiled); //A slightly simpler version

    string result = "";

    for (int i = 0; i < match.Groups.Count; i++)
    {
        result += string.Format("Group {0}: {1}\n", i, match.Groups[i].Value);
    }

    result += "\nWhat you're getting: " + match.Groups[2].Value + match.Groups[3].Value;
    result += "\nWhat you want: " + match.Groups[0].Value + " or " + match.Value;        
    result += "\nBut you don't need all those brackets and groups: " + regex2.Match(input).Value;

    return result;
}

Result: 结果:

Group 0: <sunshine>
Group 1: <
Group 2: sunshine
Group 3: >

What you're getting: sunshine>
What you want: <sunshine> or <sunshine> 
But you don't need all those brackets and groups: <sunshine> 

We will need to see more code to solve the problem. 我们将需要查看更多代码来解决该问题。 There is an off by one error somewhere in your code. 您的代码中某处出现一个错误。 It is impossible for that regular expression to return sunshine>. 该正则表达式不可能返回sunshine>. . Therefore the regular expression in question is not the problem. 因此,所讨论的正则表达式不是问题。 I would assume, without more details, that something is getting the index into the string containing your match and it is one character too far into the string. 我会假设,没有更多细节,就是使索引进入了包含您的匹配项的字符串,并且该字符距离字符串太远了。

If all you want is the text between < and > then you'd be better off using: 如果只需要<和>之间的文本,那么最好使用:

 [<]([^>]*)[>] or simpler: <([^>]+)>

If you want to include < and > then you could use: 如果要包含<和>,则可以使用:

 ([<][^>]*[>]) or simpler: (<[^>]+>)

You're expression currently has 3 Group Matches - indicated by the brackets (). 您目前的表情是3个小组赛-用方括号()表示。

In the case of < sunshine> this will currently return the following: 如果是<sunshine>,当前将返回以下内容:

Group 1 : "<" 第1组:“ <”

Group 2 : "sunshine" 第2组:“阳光”

Group 3 : ">" 第3组:“>”

So if you only looked at the 2nd group it should work! 因此,如果仅查看第二组,它应该可以工作!

The only explanation I can give for your observed behaviour is that where you pull the matches out, you are adding together Groups 2 + 3 and not Group 1. 对于观察到的行为,我只能给出的唯一解释是,在拔出火柴的地方,是将第2 + 3组而不是第1组加在一起。

What you posted works perfectly fine. 您发布的内容效果很好。

        Regex _regex = new Regex("([<])([^>]*)([>])", RegexOptions.Compiled);
        string test = "I'm walking on <sunshine>.";
        var match = _regex.Match(test);

Match is <sunshine> i guess you need to provide more code. 匹配是<sunshine>我想您需要提供更多代码。

Regex is eager by default. 正则表达式默认是热切的。 Teach it to be lazy! 教它偷懒!

What I mean is, the * operator considers as many repetitions as possible (it's said to be eager). 我的意思是,*运算符会考虑尽可能多的重复(据说很渴望)。 Use the *? 使用 *? operator instead, this tells Regex to consider as few repetitions as possible (ie to be lazy): 运算符,而是告诉正则表达式考虑尽可能少的重复(即懒惰):

<.*?>

Because you are using parenthesis, you are creating matching groups. 因为使用的是括号,所以您正在创建匹配组。 This is causing the match collection to match the groups created by the regular expression to also be matched. 这导致match集合与正则表达式创建的组也匹配。 You can reduce your regular expression to [<][^>]*[>] and it will match only on the <text> that you wish. 您可以将正则表达式简化为[<][^>]*[>]并且仅在您希望的<text>上匹配。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM