简体   繁体   中英

How to split a string not losing separator characters using BCL in C#?

I need to split a string based on some character array of separators and not lose these separators in string. Ie:

string: "Hello world!"
separators: " !"
result: ("Hello", " ", "world", "!")

Of course, i can write something that goes through that string and returns me needed result, but isn't there something already allowing me to do this, like magically configured String.Split ?

Upd: I need to solution without regexp, because it is very slow for me.

Use regular expression:

string[] parts = Regex.Split(myString, yourPattern);

Test:

string[] parts = Regex.Split("Hello World!", "(!| )");

output:

Hello
" "//just space
World
!
""//empty string

A linq solution:

var s = "Hello world!";
char[] separators = { ' ', '!' };

string current = string.Empty;
List<string> result = s.Aggregate(new List<string>(), (list, ch) =>
    {
        if (separators.Contains(ch))
        {
            list.Add(current);
            list.Add(ch.ToString());
            current = string.Empty;
        }
        else current += ch;
        return list;
    }, list => list);

This would be a purely procedural solution:

private static IEnumerable<string> Tokenize(string text, string separators)
{
    int startIdx = 0;
    int currentIdx = 0;

    while (currentIdx < text.Length)
    {
        // found a separator?
        if (separators.Contains(text[currentIdx]))
        {
            // yield a substring, if it's not empty
            if (currentIdx > startIdx)
                yield return text.Substring(startIdx, currentIdx - startIdx);

            // yield the separator
            yield return text.Substring(currentIdx, 1);

            // mark the beginning of the next token
            startIdx = currentIdx + 1;
        }

        currentIdx++;
    }
}

Note that this solution avoids returning empty tokens. For example, if the input is:

string input = "test!!";

calling Tokenize(input, "!") will return three tokens:

test
!
!

If the requirement is that two adjacent separators should have an empty token between them, then the if (currentIdx > startIdx) condition should be removed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM