简体   繁体   English

Linq的计数和分组优化

[英]Linq Optimization for Count And Group By

i've written written a code for counting each byte frequency in binary file. 我已经写了一个代码来计算二进制文件中每个字节的频率。 Using Linq. 使用Linq。 Code seem to slow when performing the Linq expression. 执行Linq表达式时,代码似乎变慢。 Its seem hard to implement Parallelism on this kind of logic. 在这种逻辑上似乎很难实现并行。 To build the freq table over 475MB it took approx 1 mins. 要构建超过475MB的频率表,大约需要1分钟。

class Program
{
    static void Main(string[] args)
    {
        Dictionary<byte, int> freq = new Dictionary<byte, int>();
        Stopwatch sw = new Stopwatch();


        sw.Start();
        //File Size 478.668 KB
        byte[] ltext = File.ReadAllBytes(@"D:\Setup.exe");
        sw.Stop();

        Console.WriteLine("Reading File {0}", GetTime(sw));




        sw.Start();
        Dictionary<byte, int> result = (from i in ltext
                                     group i by i into g
                                     orderby g.Count() descending
                                     select new { Key = g.Key, Freq = g.Count() })
                                    .ToDictionary(x => x.Key, x => x.Freq);
        sw.Stop();
        Console.WriteLine("Generating Freq Table {0}", GetTime(sw));


        foreach (var i in result)
        {
            Console.WriteLine(i);
        }
        Console.WriteLine(result.Count);
        Console.ReadLine();
    }

    static string GetTime(Stopwatch sw)
    {
        TimeSpan ts = sw.Elapsed;
        string elapsedTime = String.Format("{0} min {1} sec {2} ms",ts.Minutes, ts.Seconds, ts.Milliseconds);
        return elapsedTime;
    }

I've tried to implement non linq solution using few loops, the performance its about the same. 我尝试使用几个循环来实现non linq解决方案,其性能大致相同。 Please, any advice to optimize this. 请提出任何优化建议。 Sorry For my bad English 对不起,我的英语不好

This took a bit over a second on a 442MB file on my otherwise poky Dell laptop: 在我那本笨拙的戴尔笔记本电脑上的442MB文件上,这花了一点时间:

        byte[] ltext = File.ReadAllBytes(@"c:\temp\bigfile.bin");
        var freq = new long[256];
        var sw = Stopwatch.StartNew();
        foreach (byte b in ltext) {
            freq[b]++;
        }
        sw.Stop();
        Console.WriteLine(sw.ElapsedMilliseconds);

Very hard to beat the raw perf of an array. 很难击败数组的原始性能。

The following displays the frequency of bytes in descending order in a 465MB file on my machine in under 9 seconds when build in release mode. 在发布模式下构建时,以下显示在9秒钟内我机器上465MB文件中字节的降序显示。

Note, I've made it faster by reading the file in 100000 byte blocks (you can experiment with this - 16K blocks made no appreciable difference on my machine). 请注意,我通过以100000字节的块读取文件来提高了速度(您可以对此进行试验-16K块对我的机器没有明显的影响)。 The point is that the inner loop is the one supplying bytes. 关键是内部循环是一个提供字节的循环。 Calling Stream.ReadByte() is fast but not nearly as fast as indexing a byte in an array. 调用Stream.ReadByte()的速度很快,但不及索引数组中的字节的速度快。

Also, reading the whole file into memory exerts extreme memory pressure which will hamper performance and will fail completely if the file is large enough. 同样,将整个文件读入内存会施加极大的内存压力,这会影响性能,并且如果文件足够大,则将完全失败。

using System;
using System.Diagnostics;
using System.IO;
using System.Linq;

class Program
{
    static void Main( string[] args )
    {
        Console.WriteLine( "Reading file..." );
        var sw = Stopwatch.StartNew();
        var frequency = new long[ 256 ];
        using ( var input = File.OpenRead( @"c:\Temp\TestFile.dat" ) )
        {
            var buffer = new byte[ 100000 ];
            int bytesRead;
            do
            {
                bytesRead = input.Read( buffer, 0, buffer.Length );
                for ( var i = 0; i < bytesRead; i++ )
                    frequency[ buffer[ i ] ]++;
            } while ( bytesRead == buffer.Length );
        }
        Console.WriteLine( "Read file in " + sw.ElapsedMilliseconds + "ms" );

        var result = frequency.Select( ( f, i ) => new ByteFrequency { Byte = i, Frequency = f } )
            .OrderByDescending( x => x.Frequency );
        foreach ( var byteCount in result )
            Console.WriteLine( byteCount.Byte + " " + byteCount.Frequency );
    }

    public class ByteFrequency
    {
        public int Byte { get; set; }
        public long Frequency { get; set; }
    }
}

Why not just 为什么不只是

int[] freq = new int[256];
foreach (byte b in ltext)
    freq[b]++;

?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM