简体   繁体   English

在字符串C#中查找出现次数最多的字符,返回最长的重复出现字符集的字符串

[英]Find the most occurrence of a character in string C#, returning the string of the longest set of recurring characters

I have an extension to a question already asked here 我已经扩展到这里已经问的问题

However I want to return the list of the longest set of reaccuring characters in the original string, not a list of char & their relative count, ordered by higheest. 但是我想返回原始字符串中最长的一组可重复字符的列表,而不是按最高顺序排序的char及其相对计数列表。

I was fairly well versed in link, but had never come accross an instance of querying char types in a string and thought someone could give me a hint to help me understand specific use-cases of LINQ... 我对链接非常精通,但是从未遇到过查询字符串中的char类型的实例,并且以为有人可以给我一个提示,以帮助我了解LINQ的特定用例...

Thanks 谢谢

Using the linked example: 使用链接的示例:

var largest = input.GroupBy(x => x).OrderByDescending(x => x.Count()).First();
var asString = new string(largest.Key, largest.Count());

I'm assuming that you want the longest substring. 我假设您想要最长的子字符串。 For example, for aab💈💈💈ccc💈💈 you want 💈💈💈 例如,对于aab💈💈💈ccc💈💈您想要💈💈💈

I also assume the problem domain is strings of Unicode characters. 我还假设问题域是Unicode字符的字符串。 Unfortunately, .NET's System.String is a sequence of codeunits. 不幸的是,.NET的System.String是一个代码单元序列。 To count or index Unicode characters, you have to deal with them as codepoints. 要计算或索引Unicode字符,必须将它们作为代码点处理。 The easiest way to do that is to change the encoding to UTF-32 since there is then one int per codepoint, and a codepoint is a numeric identifier for a Unicode character [generally speaking]. 最简单的方法是将编码更改为UTF-32,因为每个代码点只有一个int ,而代码点是[通常来说] Unicode字符的数字标识符。

After that, to find the longest subsequence of identical characters, you have to run through the whole sequence. 之后,要找到相同字符的最长子序列,必须遍历整个序列。 Run-length encoding is a generalized method that I'm using as an intermediate step. 行程编码是一种通用方法,我将其用作中间步骤。 After finding the codepoint and length for the longest subsequence, I recreate a string of them. 找到最长子序列的代码点和长度后,我重新创建了它们的字符串。

        const string test = "aab💈💈💈ccc💈💈"; // contains barber pole characters
        Console.WriteLine(test);

        var longest = test.ToCodepoints().RunLengthEncode().OrderByDescending(itemCount => itemCount.Item2).First();
        var subsequence = String.Concat(Enumerable.Repeat(Char.ConvertFromUtf32(longest.Item1), longest.Item2));
        Console.WriteLine(subsequence);

Converting a string to codepoints is equivalent to converting to UTF-32. 将字符串转换为代码点等同于转换为UTF-32。 It can be done with a System.Text.Encoding method but then you end up with an array of bytes that then must be converted to codepoints. 可以使用System.Text.Encoding方法完成此操作,但是最后得到一个字节数组,然后必须将其转换为代码点。 Here is an IEnumerable that yields a sequence of int . 这是一个IEnumerable,它产生一个int序列。

    public static IEnumerable<int> ToCodepoints(this String s)
    {
        var codeunits = s.ToCharArray();
        var i = 0;

        while (i < codeunits.Length)
        {
            int codepoint;
            if (Char.IsSurrogate(codeunits[i]))
            {
                codepoint = Char.ConvertToUtf32(codeunits[i], codeunits[i + 1]);
                i += 2;
            }
            else
            {
                codepoint = codeunits[i];
                i += 1;
            }
            yield return codepoint;
        }

    }

Run-length encoding produces a Tuple of the codepoint ( Item1 ) and the length of the run ( Item2 ) for each subsequence of identical codepoints: 游程长度编码会为相同代码点的每个子序列生成代码点的元组( Item1 )和游程的长度( Item2 ):

    public static IEnumerable<Tuple<T, int>> RunLengthEncode<T>(this IEnumerable<T> sequence)
    {
        T item = default(T); // value never used
        int length = 0;
        foreach (var nextItem in sequence)
        {
            if (length == 0) // first item
            {
                item = nextItem;
                length = 1;
            }
            else if (item.Equals(nextItem)) // continuing run
            {
                length++;
            }
            else // run boundary
            {
                var run = Tuple.Create(item, length);
                item = nextItem;
                length = 1;
                yield return run;
            }
        }
        if (length > 0) // last run
        {
            yield return Tuple.Create(item, length);
        }

There is no need to create lots of intermediate objects. 无需创建许多中间对象。 You just need to keep track of the character in the longest sequence and the length of that sequence: 您只需要按照最长的序列和该序列的长度来跟踪字符:

char longest = '\0';
int longestLength = 0;

char last = '\0';
int lastLength = 0;

foreach (char c in input)
{
    if (c == last)
    {
        lastLength++;

        if (lastLength > longestLength)
        {
            longestLength = lastLength;
            longest = c;
        }
    }
    else
    {
        lastLength = 1;
    }

    last = c;
}

var result = new string(longest, longestLength);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM