Linq的计数和分组优化

Question

我已经写了一个代码来计算二进制文件中每个字节的频率。 使用Linq。 执行Linq表达式时，代码似乎变慢。 在这种逻辑上似乎很难实现并行。 要构建超过475MB的频率表，大约需要1分钟。

class Program
{
    static void Main(string[] args)
    {
        Dictionary<byte, int> freq = new Dictionary<byte, int>();
        Stopwatch sw = new Stopwatch();


        sw.Start();
        //File Size 478.668 KB
        byte[] ltext = File.ReadAllBytes(@"D:\Setup.exe");
        sw.Stop();

        Console.WriteLine("Reading File {0}", GetTime(sw));




        sw.Start();
        Dictionary<byte, int> result = (from i in ltext
                                     group i by i into g
                                     orderby g.Count() descending
                                     select new { Key = g.Key, Freq = g.Count() })
                                    .ToDictionary(x => x.Key, x => x.Freq);
        sw.Stop();
        Console.WriteLine("Generating Freq Table {0}", GetTime(sw));


        foreach (var i in result)
        {
            Console.WriteLine(i);
        }
        Console.WriteLine(result.Count);
        Console.ReadLine();
    }

    static string GetTime(Stopwatch sw)
    {
        TimeSpan ts = sw.Elapsed;
        string elapsedTime = String.Format("{0} min {1} sec {2} ms",ts.Minutes, ts.Seconds, ts.Milliseconds);
        return elapsedTime;
    }

我尝试使用几个循环来实现non linq解决方案，其性能大致相同。 请提出任何优化建议。 对不起，我的英语不好

Answer 1

在我那本笨拙的戴尔笔记本电脑上的442MB文件上，这花了一点时间：

        byte[] ltext = File.ReadAllBytes(@"c:\temp\bigfile.bin");
        var freq = new long[256];
        var sw = Stopwatch.StartNew();
        foreach (byte b in ltext) {
            freq[b]++;
        }
        sw.Stop();
        Console.WriteLine(sw.ElapsedMilliseconds);

很难击败数组的原始性能。

Answer 2

在发布模式下构建时，以下显示在9秒钟内我机器上465MB文件中字节的降序显示。

请注意，我通过以100000字节的块读取文件来提高了速度（您可以对此进行试验-16K块对我的机器没有明显的影响）。 关键是内部循环是一个提供字节的循环。 调用Stream.ReadByte（）的速度很快，但不及索引数组中的字节的速度快。

同样，将整个文件读入内存会施加极大的内存压力，这会影响性能，并且如果文件足够大，则将完全失败。

using System;
using System.Diagnostics;
using System.IO;
using System.Linq;

class Program
{
    static void Main( string[] args )
    {
        Console.WriteLine( "Reading file..." );
        var sw = Stopwatch.StartNew();
        var frequency = new long[ 256 ];
        using ( var input = File.OpenRead( @"c:\Temp\TestFile.dat" ) )
        {
            var buffer = new byte[ 100000 ];
            int bytesRead;
            do
            {
                bytesRead = input.Read( buffer, 0, buffer.Length );
                for ( var i = 0; i < bytesRead; i++ )
                    frequency[ buffer[ i ] ]++;
            } while ( bytesRead == buffer.Length );
        }
        Console.WriteLine( "Read file in " + sw.ElapsedMilliseconds + "ms" );

        var result = frequency.Select( ( f, i ) => new ByteFrequency { Byte = i, Frequency = f } )
            .OrderByDescending( x => x.Frequency );
        foreach ( var byteCount in result )
            Console.WriteLine( byteCount.Byte + " " + byteCount.Frequency );
    }

    public class ByteFrequency
    {
        public int Byte { get; set; }
        public long Frequency { get; set; }
    }
}

Answer 3

为什么不只是

int[] freq = new int[256];
foreach (byte b in ltext)
    freq[b]++;

？

Linq的计数和分组优化

问题描述

3 个解决方案

解决方案1
2 已采纳 2010-10-24 20:44:21

解决方案2
2 2010-10-24 21:38:32

解决方案3
1 2010-10-24 20:43:45

Linq的计数和分组优化

问题描述

3 个解决方案

解决方案1 2 已采纳 2010-10-24 20:44:21

解决方案2 2 2010-10-24 21:38:32

解决方案3 1 2010-10-24 20:43:45

解决方案1
2 已采纳 2010-10-24 20:44:21

解决方案2
2 2010-10-24 21:38:32

解决方案3
1 2010-10-24 20:43:45