简体   繁体   中英

Find the most occurrence of a character in string C#, returning the string of the longest set of recurring characters

I have an extension to a question already asked here

However I want to return the list of the longest set of reaccuring characters in the original string, not a list of char & their relative count, ordered by higheest.

I was fairly well versed in link, but had never come accross an instance of querying char types in a string and thought someone could give me a hint to help me understand specific use-cases of LINQ...

Thanks

Using the linked example:

var largest = input.GroupBy(x => x).OrderByDescending(x => x.Count()).First();
var asString = new string(largest.Key, largest.Count());

I'm assuming that you want the longest substring. For example, for aab💈💈💈ccc💈💈 you want 💈💈💈

I also assume the problem domain is strings of Unicode characters. Unfortunately, .NET's System.String is a sequence of codeunits. To count or index Unicode characters, you have to deal with them as codepoints. The easiest way to do that is to change the encoding to UTF-32 since there is then one int per codepoint, and a codepoint is a numeric identifier for a Unicode character [generally speaking].

After that, to find the longest subsequence of identical characters, you have to run through the whole sequence. Run-length encoding is a generalized method that I'm using as an intermediate step. After finding the codepoint and length for the longest subsequence, I recreate a string of them.

        const string test = "aab💈💈💈ccc💈💈"; // contains barber pole characters
        Console.WriteLine(test);

        var longest = test.ToCodepoints().RunLengthEncode().OrderByDescending(itemCount => itemCount.Item2).First();
        var subsequence = String.Concat(Enumerable.Repeat(Char.ConvertFromUtf32(longest.Item1), longest.Item2));
        Console.WriteLine(subsequence);

Converting a string to codepoints is equivalent to converting to UTF-32. It can be done with a System.Text.Encoding method but then you end up with an array of bytes that then must be converted to codepoints. Here is an IEnumerable that yields a sequence of int .

    public static IEnumerable<int> ToCodepoints(this String s)
    {
        var codeunits = s.ToCharArray();
        var i = 0;

        while (i < codeunits.Length)
        {
            int codepoint;
            if (Char.IsSurrogate(codeunits[i]))
            {
                codepoint = Char.ConvertToUtf32(codeunits[i], codeunits[i + 1]);
                i += 2;
            }
            else
            {
                codepoint = codeunits[i];
                i += 1;
            }
            yield return codepoint;
        }

    }

Run-length encoding produces a Tuple of the codepoint ( Item1 ) and the length of the run ( Item2 ) for each subsequence of identical codepoints:

    public static IEnumerable<Tuple<T, int>> RunLengthEncode<T>(this IEnumerable<T> sequence)
    {
        T item = default(T); // value never used
        int length = 0;
        foreach (var nextItem in sequence)
        {
            if (length == 0) // first item
            {
                item = nextItem;
                length = 1;
            }
            else if (item.Equals(nextItem)) // continuing run
            {
                length++;
            }
            else // run boundary
            {
                var run = Tuple.Create(item, length);
                item = nextItem;
                length = 1;
                yield return run;
            }
        }
        if (length > 0) // last run
        {
            yield return Tuple.Create(item, length);
        }

There is no need to create lots of intermediate objects. You just need to keep track of the character in the longest sequence and the length of that sequence:

char longest = '\0';
int longestLength = 0;

char last = '\0';
int lastLength = 0;

foreach (char c in input)
{
    if (c == last)
    {
        lastLength++;

        if (lastLength > longestLength)
        {
            longestLength = lastLength;
            longest = c;
        }
    }
    else
    {
        lastLength = 1;
    }

    last = c;
}

var result = new string(longest, longestLength);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM