简体   繁体   中英

Regex to match capture group multiple times within subsection of text document

I am passing an XML document, as text document, though a regular expression process.

<YaddaYaddaPrecedingMarkup>includes (a) and (b) and (c) and (d) and ...

<MyElement>SECTIONBEGINS (a) Item A (b) Item B (c) Item C (d) Item D</MyElement>

<YaddaYaddaFollowingMarkup>includes (a) and (b) and (c) and (d) and ...

I want my regular expression to capture the bullet labels '(a)' '(b)' '(c)' '(d)' . ( ..etc...) which appear within 'MyElement', whose text begins with "SECTIONBEGINS".

I need this regular expression to ignore any other instances of (a) ... (b) ... (c) appearing elsewhere within my XML-as-text.

If I use:

(\([a-z]\))

I match (a), (b), (c) throughout the document. That expression is too unrestricted.

If I use:

>SECTIONBEGINS(?:.*?)(\([a-z]\))(?:.*)<

I successfully match only within the correct section but I match only '(a)' (the first hit), and not the (b), (c), (d) of that same section.

And I've tried so many other variations, some of which will select the '(d)' instead but none seem to capture more than one hit.

Variant 1: Lookbehind

(?<=SECTIONBEGINS[^>]*)\([a-z]\)

Variant 2: \\G anchor + capturing group

(?:SECTIONBEGINS|\G)[^<(]*(\([a-z]\))

You need to look into the Match.Group.Captures :

Regex.Match(xml, @">SECTIONBEGINS (?<items>\([a-z]\) .+?)+<")
    .Groups["items"].Captures.Cast<Capture>()
    .Select(x => x.Value)

Or, if you like to group them into key/value pair:

var match = Regex.Match(xml, @">SECTIONBEGINS( (\((?<index>[a-z])\) (?<item>.+?)))+<");
Enumerable.Zip(
    match.Groups["index"].Captures.Cast<Capture>(),
    match.Groups["item"].Captures.Cast<Capture>(),
    Tuple.Create)
    .ToDictionary(x => x.Item1.Value, x => x.Item2.Value)

EDIT: If you don't care about the bullet labels, you can extract the items through:

Regex.Match(xml, @">SECTIONBEGINS( (\((?<index>[a-z])\) (?<item>.+?)))+<")
    .Groups["item"].Captures.Cast<Capture>()
    .Select(x => x.Value)

Or, if you want to replace the content in place:

Regex.Replace(xml, @">SECTIONBEGINS( (\((?<index>[a-z])\) (?<item>.+?)))+<",
    m => string.Format(">SECTIONBEGINS {0}<", string.Join(" ", m.Groups["item"]
        .Captures.Cast<Capture>()
        .Select((x,i) => string.Format("({0}) {1}",
            (char)(((int)'a')+i),
            x.Value.ToUpper() // TODO: your replace logic here
    ))))
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM