简体   繁体   中英

Regex only matching one group

I have a very old (and strangely delimited) string that represents a table and I want to get all text in between two "tags" (they are an abomination... here they are in all of their glory):

<<<NAME=Test User>>>
<<<DATE=11/06/2014>>>
|||COMMENTS_FOLLOW_UP=\\myserver\Reporter\testu\20140611.rtf|||
|||COMMENTS_APPOINTMENT_LIST=\\myserver\Reporter\testu\COMMENTS_APPOINTMENT_LIST_20140611.rtf|||
~~~ START MONTHLY BREAKDOWN ~~~
### ROW START ###
<<<ACTIVITY=Target Group Support>>>
<<<PERCENTAGE_OF_TIME_TAKEN_FOR_THE_MONTH=25%>>>
### ROW END ###
### ROW START ###
<<<ACTIVITY=Non-target Group Support>>>
<<<PERCENTAGE_OF_TIME_TAKEN_FOR_THE_MONTH=25%>>>
### ROW END ###
### ROW START ###
<<<ACTIVITY=Networking/Guest Speaking Activities>>>
<<<PERCENTAGE_OF_TIME_TAKEN_FOR_THE_MONTH=25%>>>
### ROW END ###
### ROW START ###
<<<ACTIVITY=Processing initial calls, making appointments, completing reports and other tasks>>>
<<<PERCENTAGE_OF_TIME_TAKEN_FOR_THE_MONTH=25%>>>
### ROW END ###
### ROW START ###
<<<ACTIVITY=Total>>>
<<<PERCENTAGE_OF_TIME_TAKEN_FOR_THE_MONTH=100%>>>
### ROW END ###
~~~ END MONTHLY BREAKDOWN ~~~
~~~ START EVENTS ~~~
### ROW START ###
<<<DATE=11/06/2014 12:00:00 AM>>>
<<<EVENT_NAME=Test's Event>>>
<<<NAME_OF_ORGANISATION/GROUP=Tests Org>>>
<<<PARTICIPANT_GROUP=Test>>>
<<<NUMBER_OF_PARTICIPANTS=50>>>
### ROW END ###
~~~ END EVENTS ~~~ 

So I need to get the text between the delimiters ~~~ START XXX ~~~ and ~~~ END XXX ~~~

So here's the pattern I whipped up: ~~~ START .+~~~(.*)~~~ END .+~~~ ;

As you can see, a master of the Regex-Fu, I am not.

NOTE: I am using the SingleLine flag.

The Problem : This Matches the correct text but only returns one group, that of the body text of the first table tag. How do I get the C# regex-a-tron 9000 to also return the the body text from the second tag in a second match group?

You can use Regex.Matches :

var matches = Regex.Matches(input_string, regex);
foreach (var m in matches)
{
    // do whatever
}

Or, you can get a match, then get the next match, etc:

var m = Regex.Match(input_string, regex);
while (m.Success)
{
    // do something with this match
    // then get the next match
    m = m.NextMatch();
}

First off, I recommend you change your regex to this:

(?s)~~~ START ([^~]*).*?END \1 ~~~
  1. After the opening tildes and START , the ([^~]*) captures the title of the block. This ensures that we can make sure the END matches later.
  2. The lazy .*? matches up to...
  3. The title (back-referenced by \\ ) and closing tildes.

Sample Code

Here is a full program you can test it with. I haven't tried it. You'll need to paste the string in there.

using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
class Program {
static void Main()    {
string s1 = @"PASTE YOUR STRING HERE";
var myRegex = new Regex(@"(?s)~~~ START ([^~]*).*?END \1 ~~~");
MatchCollection AllMatches = myRegex.Matches(s1);
Console.WriteLine("\n" + "*** Matches ***");
if (AllMatches.Count > 0)    {
    foreach (Match SomeMatch in AllMatches)    {
        Console.WriteLine("Title: " + SomeMatch.Groups[1].Value);
        Console.WriteLine("Overall Match: " + SomeMatch.Value);
    }
}

Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();

} // END Main
} // END Program

You need to call the regex matcher multiple times in a loop, until there is no match. Consider modifying the expression to avoid backtracking - in your case, this is very possible, because .+ is greedy (as opposed to "reluctant").

Here is a small demo of how you can do it:

var regex = new Regex("~~~ START ([^~]+)~~~([^~]*)~~~ END ([^~]+)~~~", RegexOptions.Multiline);
var m = regex.Match(Data);
while (m.Success) {
    Console.WriteLine("------ Start: {0} --------", m.Groups[1]);
    Console.WriteLine(m.Groups[2]);
    Console.WriteLine("------ End: {0} --------", m.Groups[3]);
    m = m.NextMatch();
}

This example running on ideone.

Note the changes above - I replaced . with [^~] to match up to the first squiggly, and I also captured the content of the start and end tags for printing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM