简体   繁体   中英

Using Regular Expressions to extract groups of numbers from a string

I need to convert a string like,

"[1,2,3,4][5,6,7,8]"

into groups of integers, adjusted to be zero based rather than one based:

{0,1,2,3} {4,5,6,7}

The following rules also apply:

  • The string must contain at least 1 group of numbers with enclosing square brackets.
  • Each group must contain at least 2 numbers.
  • Every number must be unique (not something I'm attempting to achieve with the regex).
  • 0 is not valid, but 10, 100 etc are.

Since I'm not that experienced with regular expressions, I'm currently using two;

@"^(?:\[(?:[1-9]+[\d]*,)+(?:[1-9]+[\d]*){1}\])+$";

and

@"\[(?:[1-9]+[\d]*,)+(?:[1-9]+[\d]*){1}\]";

I'm using the first one to check the input and the second to get all matches of a set of numbers inside square brackets.

I'm then using .Net string manipulation to trim off the square brackets and extract the numbers, parsing them and subtracting 1 to get the result I need.

I was wondering if I could get at the numbers better by using captures, but not sure how they work.


Final Solution:

In the end I used the following regular expression to validate the input string

@"^(?<set>\[(?:[1-9]\d{0,7}(?:]|,(?=\d))){2,})+$"

agent-j's pattern is fine for capturing the information needed but also matches a string like "[1,2,3,4][5]" and would require me to do some additional filtering of the results.

I access the captures via the named group 'set' and use a second simple regex to extract the numbers.

The '[1-9]\\d{0,7}' simplifies parsing ints by limiting numbers to 99,999,999 and avoiding overflow exceptions.

MatchCollection matches = new Regex(@"^(?<set>\[(?:[1-9]\d{0,7}(?:]|,(?=\d))){2,})+$").Matches(inputText);

if (matches.Count != 1)return;

CaptureCollection captures = matches[0].Groups["set"].Captures;

var resultJArray = new int[captures.Count][];
var numbersRegex =  new Regex(@"\d+");
for (int captureIndex = 0; captureIndex < captures.Count; captureIndex++)
{
    string capture = captures[captureIndex].Value;
    MatchCollection numberMatches = numbersRegex.Matches(capture);
    resultJArray [captureIndex] = new int[numberMatches.Count];
    for (int numberMatchIndex = 0; numberMatchIndex < numberMatches.Count; numberMatchIndex++)
    {
        string number = numberMatches[numberMatchIndex].Value;
        int numberAdjustedToZeroBase = Int32.Parse(number) - 1;
        resultJArray [captureIndex][numberMatchIndex] = numberAdjustedToZeroBase;
    }
}
string input = "[1,2,3,4][5,6,7,8][534,63433,73434,8343434]";
string pattern = @"\G(?:\[(?:(\d+)(?:,|(?=\]))){2,}\])";//\])+$";
MatchCollection matches = Regex.Matches (input, pattern);

To start out, any (regex) with plain parenthasis is a capturing group. This means that the regex engine will capture (store positions matched by that group). To avoid this (when you don't need it, use (?:regex) . I did that above.

Index 0 is special and it means the whole of the parent. IE match.Groups[0].Value is always the same as match.Value and match.Groups[0].Captures[0].Value. So, you can consider the Groups and Capture collections to start at index 1.

As you can see below, each match contains a bracketed digit group. You'll want to use captures 1-n from Group 1 of each match.

foreach (Match match in matches)
{
   // [1,2]
   // use captures 1-n from the first group.
   for (int i = 1; i < match.Group[1].Captures.Count; i++)
   {
      int number = int.Parse(match.Group[1].Captures[i]);
      if (number == 0)
         throw new Exception ("Cannot be 0.");
   }
}

Match[0] => [1,2,3,4]
  Group[0] => [1,2,3,4]
    Capture[0] => [1,2,3,4]
  Group[1] => 4
    Capture[0] => 1
    Capture[1] => 2
    Capture[2] => 3
    Capture[3] => 4
Match[1] => [5,6,7,8]
  Group[0] => [5,6,7,8]
    Capture[0] => [5,6,7,8]
  Group[1] => 8
    Capture[0] => 5
    Capture[1] => 6
    Capture[2] => 7
    Capture[3] => 8
Match[2] => [534,63433,73434,8343434]
  Group[0] => [534,63433,73434,8343434]
    Capture[0] => [534,63433,73434,8343434]
  Group[1] => 8343434
    Capture[0] => 534
    Capture[1] => 63433
    Capture[2] => 73434
    Capture[3] => 8343434

The \\G causes the match to begin at the start of the last match (so you won't match [1,2] [3,4] ). The {2,} satisfies your requirement that there be at least 2 numbers per match.

The expression will match even if there is a 0. I suggest that you put that validation in with the other non-regex stuff. It will keep the regex simpler.

The following regex will validate and also spit out match groups of the bracketed [] group and also the inside that, each number

(?:([1-9][0-9]*)\,?){2,}



[1][5]  -  fail
[1]  -  fail
[]  -  fail
[a,b,c][5]  -  fail
[1,2,3,4]  -  pass
[1,2,3,4,5,6,7,8][5,6,7,8]  -  pass
[1,2,3,4][5,6,7,8][534,63433,73434,8343434]  -  pass

那么\\d+和全球旗帜呢?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM