简体   繁体   English

捕获所有适合正则表达式的组

[英]Capture all groups that fit regex

I have a regex that does pretty much exactly what I want: \\.?(\\w+[\\s|,]{1,}\\w+[\\s|,]{1,}\\w+){1}\\.?我有一个正则表达式,几乎完全符合我的要求: \\.?(\\w+[\\s|,]{1,}\\w+[\\s|,]{1,}\\w+){1}\\.?

Meaning it captures incidences of 3 words in a row that are not separated by anything except spaces and commas (so parts of sentences only).这意味着它捕获连续 3 个单词的发生率,除了空格和逗号之外没有任何分隔(因此仅部分句子)。 However I want this to match every instance of 3 words in a sentence.但是我希望它匹配一个句子中 3 个单词的每个实例。

So in this ultra simple example:所以在这个超简单的例子中:

Hi this is Bob.

There should be 2 captures - "Hi this is" and "this is Bob".应该有 2 个捕获 - “嗨,这是”和“这是鲍勃”。 I can't seem to figure out how to get the regex engine to parse the entire statement this way.我似乎无法弄清楚如何让正则表达式引擎以这种方式解析整个语句。 Any thoughts?有什么想法吗?

You cannot just get overlapping texts in capturing groups, but you can obtain overlapping matches with capturing groups holding the substrings you need.您不能只在捕获组中获得重叠文本,但您可以获得包含您需要的子字符串的捕获组的重叠匹配

Use

(?=\b(\w+(?:[\s,]+\w+){2})\b)

See the regex demo查看正则表达式演示

The unanchored positive lookahead tests for an empty string match at every position of a string.未锚定的正向先行测试在字符串的每个位置匹配空字符串。 It does not consume characters, but can still return submatches obtained with capturing groups.它不消耗字符,但仍然可以返回通过捕获组获得的子匹配。

Regex breakdown:正则表达式细分:

  • \\b - a word boundary \\b - 单词边界
  • (\\w+(?:[\\s,]+\\w+){2}) - 3 "words" separated with , or a whitespace. (\\w+(?:[\\s,]+\\w+){2}) - 3 个用,或空格分隔的“单词”。
    • \\w+ - 1 or more alphanumeric symbols followed with \\w+ - 1 个或多个字母数字符号后跟
    • (?:[\\s,]+\\w+){2} - 2 sequences of 1 or more whitespaces or commas followed by 1 or more alphanumeric symbols. (?:[\\s,]+\\w+){2} - 1 个或多个空格或逗号后跟 1 个或多个字母数字符号的 2 个序列。

This pattern is just put into a capturing group (...) that is placed inside the lookahead (?=...) .此模式只是放入一个捕获组(...) ,该组位于前瞻(?=...)

Word boundaries are important in this expression because \\b prevents matching inside a word (between two alphanumeric characters).单词边界在此表达式中很重要,因为\\b阻止单词内部(两个字母数字字符之间)的匹配。 As the lookahead is not anchored it tests all positions inside input string, and \\b serves as a restriction on where a match can be returned.由于前瞻未锚定,它会测试输入字符串内的所有位置,而\\b则作为可以返回匹配项的限制。

In C#, you just need to collect all match.Groups[1].Value s, eg like this:在 C# 中,您只需要收集所有match.Groups[1].Value s,例如:

var s = "Hi this is Bob.";
var results = Regex.Matches(s, @"(?=\b(\w+(?:[\s,]+\w+){2})\b)")
                        .Cast<Match>()
                        .Select(p => p.Groups[1].Value)
                        .ToList();

See the IDEONE demo查看IDEONE 演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM