正则表达式在文本文档的子部分中多次匹配捕获组

Question

I am passing an XML document, as text document, though a regular expression process. 我正在传递一个XML文档，作为文本文档，虽然是正则表达式过程。

<YaddaYaddaPrecedingMarkup>includes (a) and (b) and (c) and (d) and ...

<MyElement>SECTIONBEGINS (a) Item A (b) Item B (c) Item C (d) Item D</MyElement>

<YaddaYaddaFollowingMarkup>includes (a) and (b) and (c) and (d) and ...

I want my regular expression to capture the bullet labels '(a)' '(b)' '(c)' '(d)' . 我希望我的正则表达式捕获子弹标签'（a）''（b）''（c）''（d）'。 ( ..etc...) which appear within 'MyElement', whose text begins with "SECTIONBEGINS". （.. etc ...）出现在'MyElement'中，其文本以“SECTIONBEGINS”开头。

I need this regular expression to ignore any other instances of (a) ... (b) ... (c) appearing elsewhere within my XML-as-text. 我需要这个正则表达式来忽略（a）......（b）......（c）出现在我的XML-as-text中的任何其他实例。

If I use: 如果我使用：

(\([a-z]\))

I match (a), (b), (c) throughout the document. 我在整个文件中匹配（a），（b），（c）。 That expression is too unrestricted. 那种表达方式太不受限制了。

If I use: 如果我使用：

>SECTIONBEGINS(?:.*?)(\([a-z]\))(?:.*)<

I successfully match only within the correct section but I match only '(a)' (the first hit), and not the (b), (c), (d) of that same section. 我只在正确的部分内成功匹配，但我只匹配'（a）'（第一个匹配），而不匹配同一部分的（b），（c），（d）。

And I've tried so many other variations, some of which will select the '(d)' instead but none seem to capture more than one hit. 而且我已经尝试了很多其他的变体，其中一些将选择'（d）'而不是没有似乎捕获多个命中。

Answer 1

Variant 1: Lookbehind 变式1：Lookbehind

(?<=SECTIONBEGINS[^>]*)\([a-z]\)

Variant 2: \\G anchor + capturing group 变体2：\\ G锚+捕获组

(?:SECTIONBEGINS|\G)[^<(]*(\([a-z]\))

Answer 2

You need to look into the Match.Group.Captures : 您需要查看Match.Group.Captures ：

Regex.Match(xml, @">SECTIONBEGINS (?<items>\([a-z]\) .+?)+<")
    .Groups["items"].Captures.Cast<Capture>()
    .Select(x => x.Value)

Or, if you like to group them into key/value pair: 或者，如果您想将它们分组为键/值对：

var match = Regex.Match(xml, @">SECTIONBEGINS( (\((?<index>[a-z])\) (?<item>.+?)))+<");
Enumerable.Zip(
    match.Groups["index"].Captures.Cast<Capture>(),
    match.Groups["item"].Captures.Cast<Capture>(),
    Tuple.Create)
    .ToDictionary(x => x.Item1.Value, x => x.Item2.Value)

EDIT: If you don't care about the bullet labels, you can extract the items through: 编辑：如果您不关心子弹标签，您可以通过以下方式提取项目：

Regex.Match(xml, @">SECTIONBEGINS( (\((?<index>[a-z])\) (?<item>.+?)))+<")
    .Groups["item"].Captures.Cast<Capture>()
    .Select(x => x.Value)

Or, if you want to replace the content in place: 或者，如果要替换内容：

Regex.Replace(xml, @">SECTIONBEGINS( (\((?<index>[a-z])\) (?<item>.+?)))+<",
    m => string.Format(">SECTIONBEGINS {0}<", string.Join(" ", m.Groups["item"]
        .Captures.Cast<Capture>()
        .Select((x,i) => string.Format("({0}) {1}",
            (char)(((int)'a')+i),
            x.Value.ToUpper() // TODO: your replace logic here
    ))))
)

正则表达式在文本文档的子部分中多次匹配捕获组

问题描述

2 个解决方案

解决方案1
2 已采纳 2017-09-11 19:34:49

解决方案2
1 2017-09-11 18:54:13

正则表达式在文本文档的子部分中多次匹配捕获组

问题描述

2 个解决方案

解决方案1 2 已采纳 2017-09-11 19:34:49

解决方案2 1 2017-09-11 18:54:13

解决方案1
2 已采纳 2017-09-11 19:34:49

解决方案2
1 2017-09-11 18:54:13