简体   繁体   中英

Regex match but ignore specific characters in the output

I need a regex PATTERN (to be used in C#) that will match integer values WITH 3-digit comma separators but WON'T return the commas in the resulting match value. For example, I need the following code to write 1 , 1234 , and 1234567 to the console:

string text = "This 1 is 1,234 a 1,234,567 sentence 7,654.321.";
// NOTE: value "7,654.321" would preferably NOT match, 
//       but it is acceptable for now if it does
MatchCollection matches = Regex.Matches(text, PATTERN);
foreach (Match match in matches)
    Console.Write(match.Value + " ");

I CANNOT call Regex.Matches and then do a String.Replace to remove the commas; it all must happen in the regex PATTERN (because all my regex expressions are being pulled from a database and cannot include logic outside the pattern itself without lots of spaghetti code). As noted, I would prefer not to match rational values, but that should be easy to fix once I get the comma exclusion working.

The following pattern DOES NOT WORK , but it is probably pretty close to what I need:

// THIS PATTEN DOES NOT WORK!!!
//    but is probably close to what I need
string PATTERN = @"([\+-]?[0-9]+[(?<=,)[0,9]{3}]*)([eE][\+]?[0-9]+)?"

If you remove the [(?<=,)[0,9]{3}]* from above, you have a standard integer pattern. Once again, I need to accept commas in the integer, but not return them as part of the match. How should I change this pattern?

A regex match is a whole substring of the input string. It can't be a set of substrings - it has to be one substring.

Similarly, the capturing groups can only capture substrings so you can't do much about this either.

But since you're using .NET you could try a really ugly hack by leveraging the capture stack, if you can afford to add some general-purpose code.

First, the regex. It is simplified to the minimum just so it's easier to understand:

(?:(?<concat>\d+),?)+

A full version of the regex is provided below, but for now we'll stick with that one.

Then, in your code you could implement the following logic:

  • If the regex doesn't contain a group named concat , then process as usual
  • If it does, do the following instead of getting the whole match:
    • Extract all captures of that group: match.Groups["concat"].Captures
    • Concat their captured values
    • And then use that value

This would be similar to this:

public static IEnumerable<string> GetValues(string input)
{
    // Suppose regex could be any regex
    var regex = new Regex(@"(?:(?<concat>\d+),?)+");

    foreach (Match match in regex.Matches(input))
    {
        // Does this regex have our special feature?
        if (regex.GroupNumberFromName("concat") >= 0)
        {
            // Concat the captured values
            var captures = match.Groups["concat"].Captures.Cast<Capture>().Select(c => c.Value).ToArray();
            yield return String.Concat(captures);
        }
        else
        {
            // This is a normal regex
            yield return match.Value;   
        }
    }
}

Ideone demo

Ok, this is a hack, but it would let you keep the logic in a declarative and reusable way in the regex.

Now the full regex you posted would look something like this in its hacked version:

(?<concat>[-+])?(?<concat>[0-9]+)(?:,(?<concat>[0-9]{3}))*(?<concat>[eE][-+]?[0-9]+)?

Demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM