简体   繁体   中英

C#: split a string into runs of characters, numbers and delimited strings and process it

OK my regex is a bit rusty and I've been struggling with this particular problem...

I need to split and process a string containing any number of the following, in any order:

  • Chars (lowercase letters only)
  • Quote delimited strings
  • Ints

The strings are pretty weird (I don't have control over them). When there's more than one number in a row in the string they're seperated by a comma. They need to be processed in the same order that they appeared in the original string.

For example, a string might look like:

abc20a"Hi""OK"100,20b

With this particular string the resulting call stack would look a bit like:

ProcessLetters( new[] { 'a', 'b', 'c' } );
ProcessInts( 20 );
ProcessLetters( 'a' );
ProcessStrings( new[] { "Hi", "OK" } );
ProcessInts( new[] { 100, 20 } );
ProcessLetters( 'b' );

What I could do is treat it a bit like CSV, where you build tokens by processing the characters one at a time, but I think it could be more easily done with a regex?

You can make your regexp match each of the three separate options with the or operator |. This should catch valid tokens, skipping commas and other chars.

/[a-z]|[0-9]+|"[^"]"/

Can your strings contain escaped quotes?

You can use the pattern contained in this string:

@"(""[^""]*""|[a-z]|\d+)"

to tokenize the input string you provided. This pattern captures three things: simple quoted strings (no embeded quotes), lower-case characters, and one or more digits.

If your quoted strings can have escaped quotes within them (eg, "Hi\\"There\\"""OK""Pilgrim" ) then you can use this pattern to capture and tokenize them along with the rest of the input string:

@"((?:""[^""\\]*(?:\\.[^""\\]*)*"")|[a-z]|\d+)"

Here's an example:

MatchCollection matches = Regex.Matches(@"abc20a""Hi\""There\""""""OK""""Pilgrim""100,20b", @"((?:""[^""\\]*(?:\\.[^""\\]*)*"")|[a-z]|\d+)");

foreach (Match match in matches)
{
    Console.WriteLine(match.Value);
}

Returns the string tokens:

a
b
c
20
a
"Hi\"There\""
"OK"
"Pilgrim"
100
20
b

One of the nice thing about this approach is you can just check the first character to see what stack you need to put your elements in. If the first character is alpha, then it goes into the ProcessLetters stack, if the character is numeric, then it goes into ProcessInts. If the first character is a quote, then it goes into ProcessStrings after trimming the leading and trailing quotes and calling Regex.Unescape() to unescape the embedded quotes.

static void Main(string[] args)
{
    string test = @"abc20a""Hi""""OK""100,20b";
    string[] results = Regex.Split(test, @"(""[a-zA-Z]+""|\d+|[a-zA-Z]+)");

    foreach (string result in results)
    {
        if (!String.IsNullOrEmpty(result) && result != ",")
        {
            Console.WriteLine("result: " + result);
        }
    }
    Console.ReadLine();
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM