简体   繁体   中英

Is there a lazy `String.Split` in C#

All string.Split methods seems to return an array of strings ( string[] ).

I'm wondering if there is a lazy variant that returns an IEnumerable<string> such that one for large strings (or an infinite length IEnumerable<char> ), when one is only interested in a first subsequences, one saves computational effort as well as memory. It could also be useful if the string is constructed by a device/program (network, terminal, pipes) and the entire strings is thus not necessary immediately fully available. Such that one can already process the first occurences.

Is there such method in the .NET framework?

There is no such thing built-in. Regex.Matches is lazy if I interpret the decompiled code correctly. Maybe you can make use of that.

Or, you simply write your own split function.

Actually, you could image most string functions generalized to arbitrary sequences. Often, even sequences of T , not just char . The BCL does not emphasize that at generalization all. There is no Enumerable.Subsequence for example.

You could easily write one:

public static class StringExtensions
{
    public static IEnumerable<string> Split(this string toSplit, params char[] splits)
    {
        if (string.IsNullOrEmpty(toSplit))
            yield break;

        StringBuilder sb = new StringBuilder();

        foreach (var c in toSplit)
        {
            if (splits.Contains(c))
            {
                yield return sb.ToString();
                sb.Clear();
            }
            else
            {
                sb.Append(c);
            }
        }

        if (sb.Length > 0)
            yield return sb.ToString();
    }
}

Clearly, I haven't tested it for parity with string.split, but I believe it should work just about the same.

As Servy notes, this doesn't split on strings. That's not as simple, and not as efficient, but it's basically the same pattern.

public static IEnumerable<string> Split(this string toSplit, string[] separators)
{
    if (string.IsNullOrEmpty(toSplit))
        yield break;

    StringBuilder sb = new StringBuilder();
    foreach (var c in toSplit)
    {
        var s = sb.ToString();
        var sep = separators.FirstOrDefault(i => s.Contains(i));
        if (sep != null)
        {
            yield return s.Replace(sep, string.Empty);
            sb.Clear();
        }
        else
        {
            sb.Append(c);
        }
    }

    if (sb.Length > 0)
        yield return sb.ToString();
}

Nothing built-in, but feel free to rip my Tokenize method:

 /// <summary>
/// Splits a string into tokens.
/// </summary>
/// <param name="s">The string to split.</param>
/// <param name="isSeparator">
/// A function testing if a code point at a position
/// in the input string is a separator.
/// </param>
/// <returns>A sequence of tokens.</returns>
IEnumerable<string> Tokenize(string s, Func<string, int, bool> isSeparator = null)
{
    if (isSeparator == null) isSeparator = (str, i) => !char.IsLetterOrDigit(str, i);

    int startPos = -1;

    for (int i = 0; i < s.Length; i += char.IsSurrogatePair(s, i) ? 2 : 1)
    {
        if (!isSeparator(s, i))
        {
            if (startPos == -1) startPos = i;
        }
        else if (startPos != -1)
        {
            yield return s.Substring(startPos, i - startPos);
            startPos = -1;
        }
    }

    if (startPos != -1)
    {
        yield return s.Substring(startPos);
    }
}

There is no built-in method to do this as far as I'm know. But it doesn't mean you can't write one. Here is a sample to give you an idea:

public static IEnumerable<string> SplitLazy(this string str, params char[] separators)
{
    List<char> temp = new List<char>();
    foreach (var c in str)
    {
        if (separators.Contains(c) && temp.Any())
        {
             yield return new string(temp.ToArray());
             temp.Clear();
        }
        else
        {
            temp.Add(c);
        }
    }
    if(temp.Any()) { yield return new string(temp.ToArray()); }
}

Ofcourse this doesn't handle all cases and can be improved.

I wrote this variant which supports also SplitOptions and count. It behaves same like string.Split in all test cases I tried. The nameof operator is C# 6 sepcific and can be replaced by "count".

public static class StringExtensions
{
    /// <summary>
    /// Splits a string into substrings that are based on the characters in an array. 
    /// </summary>
    /// <param name="value">The string to split.</param>
    /// <param name="options"><see cref="StringSplitOptions.RemoveEmptyEntries"/> to omit empty array elements from the array returned; or <see cref="StringSplitOptions.None"/> to include empty array elements in the array returned.</param>
    /// <param name="count">The maximum number of substrings to return.</param>
    /// <param name="separator">A character array that delimits the substrings in this string, an empty array that contains no delimiters, or null. </param>
    /// <returns></returns>
    /// <remarks>
    /// Delimiter characters are not included in the elements of the returned array. 
    /// If this instance does not contain any of the characters in separator the returned sequence consists of a single element that contains this instance.
    /// If the separator parameter is null or contains no characters, white-space characters are assumed to be the delimiters. White-space characters are defined by the Unicode standard and return true if they are passed to the <see cref="Char.IsWhiteSpace"/> method.
    /// </remarks>
    public static IEnumerable<string> SplitLazy(this string value, int count = int.MaxValue, StringSplitOptions options = StringSplitOptions.None, params char[] separator)
    {
        if (count <= 0)
        {
            if (count < 0) throw new ArgumentOutOfRangeException(nameof(count), "Count cannot be less than zero.");
            yield break;
        }

        Func<char, bool> predicate = char.IsWhiteSpace;
        if (separator != null && separator.Length != 0)
            predicate = (c) => separator.Contains(c);

        if (string.IsNullOrEmpty(value) || count == 1 || !value.Any(predicate))
        {
            yield return value;
            yield break;
        }

        bool removeEmptyEntries = (options & StringSplitOptions.RemoveEmptyEntries) != 0;
        int ct = 0;
        var sb = new StringBuilder();
        for (int i = 0; i < value.Length; ++i)
        {
            char c = value[i];
            if (!predicate(c))
            {
                sb.Append(c);
            }
            else
            {
                if (sb.Length != 0)
                {
                    yield return sb.ToString();
                    sb.Clear();
                }
                else
                {
                    if (removeEmptyEntries)
                        continue;
                    yield return string.Empty;
                }

                if (++ct >= count - 1)
                {
                    if (removeEmptyEntries)
                        while (++i < value.Length && predicate(value[i]));
                    else
                        ++i;
                    if (i < value.Length - 1)
                    {
                        sb.Append(value, i, value.Length - i);
                        yield return sb.ToString();
                    }
                    yield break;
                }
            }
        }

        if (sb.Length > 0)
            yield return sb.ToString();
        else if (!removeEmptyEntries && predicate(value[value.Length - 1]))
            yield return string.Empty;
    }

    public static IEnumerable<string> SplitLazy(this string value, params char[] separator)
    {
        return value.SplitLazy(int.MaxValue, StringSplitOptions.None, separator);
    }

    public static IEnumerable<string> SplitLazy(this string value, StringSplitOptions options, params char[] separator)
    {
        return value.SplitLazy(int.MaxValue, options, separator);
    }

    public static IEnumerable<string> SplitLazy(this string value, int count, params char[] separator)
    {
        return value.SplitLazy(count, StringSplitOptions.None, separator);
    }
}

I wanted the functionality of Regex.Split , but in a lazily evaluated form. The code below just runs through all Matches in the input string, and produces the same results as Regex.Split :

public static IEnumerable<string> Split(string input, string pattern, RegexOptions options = RegexOptions.None)
{
    // Always compile - we expect many executions
    var regex = new Regex(pattern, options | RegexOptions.Compiled);

    int currentSplitStart = 0;
    var match = regex.Match(input);

    while (match.Success)
    {
        yield return input.Substring(currentSplitStart, match.Index - currentSplitStart);

        currentSplitStart = match.Index + match.Length;
        match = match.NextMatch();
    }

    yield return input.Substring(currentSplitStart);
}

Note that using this with the pattern parameter @"\\s" will give you the same results as string.Split() .

Lazy split without create tempory string.

Chunk of string copied using system coll mscorlib String.SubString.

public static IEnumerable<string> LazySplit(this string source, StringSplitOptions stringSplitOptions, params string[] separators)
{
    var sourceLen = source.Length;

    bool IsSeparator(int index, string separator)
    {
        var separatorLen = separator.Length;

        if (sourceLen < index + separatorLen)
        {
            return false;
        }

        for (var i = 0; i < separatorLen; i++)
        {
            if (source[index + i] != separator[i])
            {
                return false;
            }
        }

        return true;
    }

    var indexOfStartChunk = 0;

    for (var i = 0; i < source.Length; i++)
    {
        foreach (var separator in separators)
        {
            if (IsSeparator(i, separator))
            {
                if (indexOfStartChunk == i && stringSplitOptions != StringSplitOptions.RemoveEmptyEntries)
                {
                    yield return string.Empty;
                }
                else
                {
                    yield return source.Substring(indexOfStartChunk, i - indexOfStartChunk);
                }

                i += separator.Length;
                indexOfStartChunk = i--;
                break;
            }
        }
    }

    if (indexOfStartChunk != 0)
    {
        yield return source.Substring(indexOfStartChunk, sourceLen - indexOfStartChunk);
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM