简体   繁体   English

有人可以解释这种 RegEx 行为吗?

[英]Can someone explain this RegEx behaviour?

I experience a strange behaviour when working with RegEx.在使用 RegEx 时,我遇到了一个奇怪的行为。

dataString = "#Name #Location New York #Rating"
string[] rawValues = Regex.Split(dataString.Trim(), "(^|\\s)+#\\w+");

The pattern matches: "#Name", " #Location", " #Rating" (which is what I intend to match).模式匹配: "#Name", " #Location", " #Rating" (这是我打算匹配的)。
The split returns: ["", "", "", " ", "New York", " ", ""]拆分返回: ["", "", "", " ", "New York", " ", ""]

Question #1: The cunfusion starts already here.问题 1:混淆已经从这里开始了。 Why are there empty strings at positions 0,1,2 ?为什么在0,1,2位置有空字符串? Two for the matches and one because it was at the first position of the string?两个用于匹配,一个因为它位于字符串的第一个位置?

But this was not the strange part.但这并不是奇怪的部分。

string[] rawValues = Regex.Split(dataString.Trim(), "(\\s|^)+#(\\w*[A-Za-z_]+\\w*)");

The pattern matches: "#Name", " #Location", " #Rating" (the same as before).模式匹配: "#Name", " #Location", " #Rating" (与之前相同)。
But the split returns: ["", "", "Name", "", " ", "Location"," New York", " ", "Rating",""]但拆分返回: ["", "", "Name", "", " ", "Location"," New York", " ", "Rating",""]

Question #2: A pattern which leads to the exact same match, results in a totally different split output.问题 2:导致完全相同匹配的模式会导致完全不同的拆分输出。 How is this possible??这怎么可能??

The reason is this sentence from MSDN :原因是来自MSDN 的这句话:

If capturing parentheses are used in a Regex.Split expression, any captured text is included in the resulting string array.如果在 Regex.Split 表达式中使用捕获括号,则任何捕获的文本都包含在结果字符串数组中。

You you shouldn't use capturing groups in Split if you really just want to split the string at matches.如果您真的只想在匹配时拆分字符串,则不应在Split使用捕获组。 You can avoid capturing groups, by using (?:...) in place of every (...) you have.您可以通过使用(?:...)代替您拥有的每个(...)来避免捕获组。

Plus, as you correctly assumed.另外,正如您正确假设的那样。 The first and last "" originate from the fact that the string starts and ends with a match (so the empty string before and after these matches will be reported in the split).第一个和最后一个""源于字符串以匹配开始和结束的事实(因此这些匹配之前和之后的空字符串将在拆分中报告)。

Here is a regular expression that is better suited for you purposes:这是一个更适合您目的的正则表达式:

@"(?:^|\s+)#\w*[A-Za-z_]+\w*"

Note that having the + outside of your first subpattern was also unnecessary and led to awkward side effects.请注意,在第一个子模式之外使用+也是不必要的,并且会导致尴尬的副作用。 Firstly it allowed the group to capture multiple times (which is why you got two addition "" , "" : one for ^ and one for \\s ).首先,它允许该组捕获多次(这就是为什么您添加了两个"" , "" :一个用于^ ,另一个用于\\s )。 Secondly, there is no need to repeat ^ after the first space character has been matched, so it is enough to repeat only the space character.其次,匹配第一个空格字符后不需要重复^ ,所以只重复空格字符就足够了。 Also, there is no need to group the word after # at all.此外,根本不需要将#后面的单词分组。

However, if all you want is to match something like #name when it is at the start of the string or preceded by a space (ie not* preceded by a **non-space character), why include possible spaces in the match at all.但是,如果您只想在字符串开头或前面有空格(即不* 前面有 ** 非空格字符)时匹配诸如#name之类的内容,为什么要在匹配中包含可能的空格全部。 A negative lookbehind gives you a nice way out:消极的回顾给你一个很好的出路:

@"(?<!\S)#\w*[A-Za-z_]+\w*"

This does exactly what described above.这正是上面描述的。 The (?<!\\S) matches if there is no non-space character left to it (without including a space-character in the match if there is one). (?<!\\S)如果没有剩余的非空格字符则匹配(如果有空格字符,则匹配中不包含空格字符)。 That covers both cases without alternation, and you don't need to Trim your key names.这涵盖了没有交替的两种情况,并且您不需要Trim您的键名。

Because the regular expression on which you are splitting matches 1 or more whitespace followed by a hash ('#') followed by 1 or more word characters.因为您拆分的正则表达式匹配 1 个或多个空格,后跟一个哈希 ('#'),然后是 1 个或多个单词字符。

Anything that matches that doesn't get included in the results.任何不包含在结果中的匹配项。

There's two ways you can do this:有两种方法可以做到这一点:

  1. Split on what is not wanted and filter the results.拆分不需要的内容并过滤结果。
  2. Actively search for only what is wanted.只主动搜索想要的东西。

Here's some code with both of the above options:这是一些具有上述两个选项的代码:

static void Main( string[] args )
{
    string   sourceText = "#Name #Location New York #Rating" ;

    // option 1: split on whitespace and then toss whatever isn't wanted
    string[] hashTokens1 = sourceText.Split().Where( x => x.StartsWith("#") ).ToArray() ;

    // option 2: actively search for what is desired
    string[] hashTokens2 = ParseSourceData( sourceText ).ToArray() ;

    return ;

}

private static readonly Regex hashTokenPattern = new Regex( @"#\w+");
private static IEnumerable<string> ParseSourceData( string s )
{
    for ( Match m = hashTokenPattern.Match( s ) ; m.Success ; m = m.NextMatch() )
    {
        yield return m.Value ;
    }
}

Myself, I'd use the 2nd option as it better states the case for what you're trying to accomplish.我自己,我会使用第二个选项,因为它更好地说明了您要完成的工作。 A good general rule is to prefer positive assertions or tests over negative.一个好的一般规则是更喜欢肯定的断言或测试而不是否定的。

You can also write the 2nd option as a "one-liner, thus:您还可以将第二个选项写为“单行”,因此:

// option 2: actively search for what is desired
Regex hashTokenPattern = new Regex( @"#\w+");
string[] hashTokens2 = hashTokenPattern.Matches(sourceText).Cast<Match>().Select(x=>x.Value).ToArray();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM