捕获所有适合正则表达式的组

Question

I have a regex that does pretty much exactly what I want: \\.?(\\w+[\\s|,]{1,}\\w+[\\s|,]{1,}\\w+){1}\\.?我有一个正则表达式，几乎完全符合我的要求： \\.?(\\w+[\\s|,]{1,}\\w+[\\s|,]{1,}\\w+){1}\\.?

Meaning it captures incidences of 3 words in a row that are not separated by anything except spaces and commas (so parts of sentences only).这意味着它捕获连续 3 个单词的发生率，除了空格和逗号之外没有任何分隔（因此仅部分句子）。 However I want this to match every instance of 3 words in a sentence.但是我希望它匹配一个句子中 3 个单词的每个实例。

So in this ultra simple example:所以在这个超简单的例子中：

Hi this is Bob.

There should be 2 captures - "Hi this is" and "this is Bob".应该有 2 个捕获 - “嗨，这是”和“这是鲍勃”。 I can't seem to figure out how to get the regex engine to parse the entire statement this way.我似乎无法弄清楚如何让正则表达式引擎以这种方式解析整个语句。 Any thoughts?有什么想法吗？

Answer 1

You cannot just get overlapping texts in capturing groups, but you can obtain overlapping matches with capturing groups holding the substrings you need.您不能只在捕获组中获得重叠文本，但您可以获得与包含您需要的子字符串的捕获组的重叠匹配。

Use用

(?=\b(\w+(?:[\s,]+\w+){2})\b)

See the regex demo查看正则表达式演示

The unanchored positive lookahead tests for an empty string match at every position of a string.未锚定的正向先行测试在字符串的每个位置匹配空字符串。 It does not consume characters, but can still return submatches obtained with capturing groups.它不消耗字符，但仍然可以返回通过捕获组获得的子匹配。

Regex breakdown:正则表达式细分：

\\b - a word boundary \\b - 单词边界
(\\w+(?:[\\s,]+\\w+){2}) - 3 "words" separated with , or a whitespace. (\\w+(?:[\\s,]+\\w+){2}) - 3 个用,或空格分隔的“单词”。
- \\w+ - 1 or more alphanumeric symbols followed with \\w+ - 1 个或多个字母数字符号后跟
- (?:[\\s,]+\\w+){2} - 2 sequences of 1 or more whitespaces or commas followed by 1 or more alphanumeric symbols. (?:[\\s,]+\\w+){2} - 1 个或多个空格或逗号后跟 1 个或多个字母数字符号的 2 个序列。

This pattern is just put into a capturing group (...) that is placed inside the lookahead (?=...) .此模式只是放入一个捕获组(...) ，该组位于前瞻(?=...) 。

Word boundaries are important in this expression because \\b prevents matching inside a word (between two alphanumeric characters).单词边界在此表达式中很重要，因为\\b阻止单词内部（两个字母数字字符之间）的匹配。 As the lookahead is not anchored it tests all positions inside input string, and \\b serves as a restriction on where a match can be returned.由于前瞻未锚定，它会测试输入字符串内的所有位置，而\\b则作为可以返回匹配项的限制。

In C#, you just need to collect all match.Groups[1].Value s, eg like this:在 C# 中，您只需要收集所有match.Groups[1].Value s，例如：

var s = "Hi this is Bob.";
var results = Regex.Matches(s, @"(?=\b(\w+(?:[\s,]+\w+){2})\b)")
                        .Cast<Match>()
                        .Select(p => p.Groups[1].Value)
                        .ToList();

See the IDEONE demo查看IDEONE 演示

捕获所有适合正则表达式的组

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-11-17 13:25:57

捕获所有适合正则表达式的组

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-11-17 13:25:57

解决方案1
2 已采纳 2015-11-17 13:25:57