简体   繁体   English

如何在此算法中优化内存使用?

[英]How to optimize memory usage in this algorithm?

I'm developing a log parser, and I'm reading files of strings of more than 150MB.- This is my approach, Is there any way to optimize what is in the While statement? 我正在开发一个日志解析器,并且正在读取150MB以上的字符串文件。-这是我的方法,是否有任何方法可以优化While语句中的内容? The problem is that is consuming a lot of memory.- I also tried with a stringbuilder facing the same memory comsuption.- 问题在于这会占用大量内存。-我也尝试过使用一个面对相同内存消耗的stringbuilder。-

private void ReadLogInThread()
        {
            string lineOfLog = string.Empty;

            try
            {
                StreamReader logFile = new StreamReader(myLog.logFileLocation);
                InformationUnit infoUnit = new InformationUnit();

                infoUnit.LogCompleteSize = myLog.logFileSize;

                while ((lineOfLog = logFile.ReadLine()) != null)
                {
                    myLog.transformedLog.Add(lineOfLog); //list<string>
                    myLog.logNumberLines++;

                    infoUnit.CurrentNumberOfLine = myLog.logNumberLines;
                    infoUnit.CurrentLine = lineOfLog;
                    infoUnit.CurrentSizeRead += lineOfLog.Length;


                    if (onLineRead != null)
                        onLineRead(infoUnit);
                }
            }
            catch { throw; }
        }

Thanks in advance! 提前致谢!

EXTRA: Im saving each line because after reading the log I will need to check for some information on every stored line.- The language is C# 额外:我保存每行,因为在读取日志后,我将需要检查每行存储的信息。-语言为C#

Memory economy can be achieved if your log lines are actually can be parsed to a data row representation. 如果您的日志行实际上可以解析为数据行表示形式,则可以实现内存节省。

Here is a typical log line i can think of: 这是我能想到的典型日志行:

Event at: 2019/01/05:0:24:32.435, Reason: Operation, Kind: DataStoreOperation, Operation Status: Success 事件在:2019/01/05:0:24:32.435,原因:操作,种类:DataStoreOperation,操作状态:成功

This line takes 200 bytes in memory. 该行占用200字节的内存。 At the same time, following representation just takes belo 16 bytes: 同时,以下表示仅占16个字节:

Enum LogReason { Operation, Error, Warning };
Enum EventKind short { DataStoreOperation, DataReadOperation };
Enum OperationStatus short { Success, Failed };

LogRow
{
  DateTime EventTime;
  LogReason Reason;
  EventKind Kind;
  OperationStatus Status;
}

Another optimization possibility is just parsing a line to array of string tokens, this way you could make use of string interning. 另一种优化的可能性是将一行解析为字符串标记数组,这样您就可以利用字符串实习。 For example, if a word "DataStoreOperation" takes 36 bytes, and if it has 1000000 entiries in the file, the economy is (18*2 - 4) * 1000000 = 32 000 000 bytes. 例如,如果单词“ DataStoreOperation”占用36个字节,并且文件中包含1000000个条目,则经济性为(18 * 2-4)* 1000000 = 32000000字节。

Try to make your algorithm sequential. 尝试使算法顺序化。

Using an IEnumerable instead of a List helps playing nice with memory, while keeping same semantic as working with a list, if you don't need random access to lines by index in the list. 如果您不需要通过列表中的索引随机访问行,则使用IEnumerable而不是List有助于更好地利用内存,同时保持与使用list相同的语义。

IEnumerable<string> ReadLines()
{
  // ...
  while ((lineOfLog = logFile.ReadLine()) != null)
  {
    yield return lineOfLog;
  }
}
//...
foreach( var line in ReadLines() )
{
  ProcessLine(line);
}

I am not sure if it will fit your project but you can store the result in StringBuilder instead of strings list. 我不确定它是否适合您的项目,但是您可以将结果存储在StringBuilder中,而不是字符串列表中。

For example, this process on my machine takes 250MB memory after loading (file is 50MB): 例如,加载后我的计算机上的此过程占用250MB内存(文件为50MB):

static void Main(string[] args)
{
    using (StreamReader streamReader = File.OpenText("file.txt"))
    {
        var list = new List<string>();
        string line;
        while (( line=streamReader.ReadLine())!=null)
        {
            list.Add(line);
        }
    }
}

On the other hand, this code process will take only 100MB: 另一方面,此代码过程将仅占用100MB:

static void Main(string[] args)
{
    var stringBuilder = new StringBuilder();
    using (StreamReader streamReader = File.OpenText("file.txt"))
    {
        string line;
        while (( line=streamReader.ReadLine())!=null)
        {
            stringBuilder.AppendLine(line);
        }
    }
}

Memory usage keeps going up because you're simply adding them to a List<string>, constantly growing. 内存使用率一直在上升,因为您只是将它们添加到List <string>,并且不断增长。 If you want to use less memory one thing you can do is to write the data to disk, rather than keeping it in scope. 如果要使用更少的内存,您可以做的一件事是将数据写入磁盘,而不是将其保留在范围内。 Of course, this will greatly cause speed to degrade. 当然,这将极大地导致速度降低。

Another option is to compress the string data as you're storing it to your list, and decompress it coming out but I don't think this is a good method. 另一种选择是在将字符串数据存储到列表中时对其进行压缩,然后将其解压缩,但我认为这不是一个好方法。

Side Note: 边注:

You need to add a using block around your streamreader. 您需要在流阅读器周围添加一个using块。

using (StreamReader logFile = new StreamReader(myLog.logFileLocation))

Consider this implementation: (I'm speaking c/c++, substitute c# as needed) 考虑以下实现:(我在讲c / c ++,根据需要替换c#)

Use fseek/ftell to find the size of the file.

Use malloc to allocate a chunk of memory the size of the file + 1;
Set that last byte to '\0' to terminate the string.

Use fread to read the entire file into the memory buffer.
You now have char * which holds the contents of the file as a 
string.

Create a vector of const char * to hold pointers to the positions 
in memory where each line can be found.   Initialize the first element 
of the vector to the first byte of the memory buffer.

Find the carriage control characters (probably \r\n)   Replace the 
\r by \0 to make the line a string.   Increment past the \n.  
This new pointer location is pushed back onto the vector.

Repeat the above until all of the lines in the file have been NUL 
terminated, and are pointed to by elements in the vector.

Iterate though the vector as needed to investigate the contents of 
each line, in your business specific way.

When you are done, close the file, free the memory,  and continue 
happily along your way.

1) Compress the strings before you store them (ie see System.IO.Compression and GZipStream). 1)在存储字符串之前先对其进行压缩(例如,请参阅System.IO.Compression和GZipStream)。 This would probably kill the performance of your program though since you'd have to uncompress to read each line. 尽管这可能会破坏程序的性能,因为您必须解压缩才能读取每一行。

2) Remove any extra white space characters or common words you can do without. 2)删除多余的空白字符或常用字。 ie if you can understand what the log is saying with the words "the, a, of...", remove them. 即,如果您可以理解日志中用“ the,a,of ...”字样说的话,请将其删除。 Also, shorten any common words (ie change "error" to "err" and "warning" to "wrn"). 另外,请缩短所有常用词(即,将“错误”更改为“错误”,将“警告”更改为“错误”)。 This would slow down this step in the process but shouldn't affect performance of the rest. 这将减慢此过程的步骤,但不会影响其余部分的性能。

What encoding is your original file? 您的原始文件编码是什么? If it is ascii then just the strings alone are going to take over 2x the size of the file just to load up into your array. 如果是ascii,那么仅字符串就将占用文件大小的2倍,才可以加载到数组中。 AC# character is 2 bytes and a C# string adds an extra 20 bytes per string in addition to the characters. AC#字符是2个字节,而C# 字符串除了这些字符外,每个字符串还增加了20个字节。

In your case, since it is a log file, you can probably exploit the fact that there is a lot of repetition in the the messages. 在您的情况下,由于它是一个日志文件,因此您可以利用以下事实:消息中有很多重复。 You most likely can parse the incoming line into a data structure which reduces the memory overhead. 您很可能可以将输入行解析为数据结构,从而减少内存开销。 For example, if you have a timestamp in the log file you can convert that to a DateTime value which is 8 bytes . 例如,如果日志文件中有时间戳,则可以将其转换为DateTime值,该值是8 bytes Even a short timestamp of 1/1/10 would add 12 bytes to the size of a string, and a timestamp with time information would be even longer. 即使是1/1/10的短时间戳,字符串的大小也会增加12个字节,并且带有时间信息的时间戳会更长。 Other tokens in your log stream might be able to be turned into a code or an enum in a similar manner. 日志流中的其他令牌可能可以通过类似的方式转换为代码或枚举。

Even if you have the leave the value as a string, if you can break it down into pieces that are used a lot, or remove boilerplate that is not needed at all you can probably cut down on your memory usage. 即使您将值保留为字符串,如果您可以将其分解为经常使用的部分,或者删除根本不需要的样板,则可能会减少内存使用量。 If there are a lot of common strings you can Intern them and only pay for 1 string no matter how many you have. 如果有很多常见的字符串,您可以实习它们,无论您有多少,都只需支付1个字符串的费用。

If you must store the raw data, and assuming that your logs are mostly ASCII, then you can save some memory by storing UTF8 bytes internally. 如果必须存储原始数据,并假设日志大部分为ASCII,则可以通过内部存储UTF8字节来节省一些内存。 Strings are UTF16 internally, so you're storing an extra byte for each character. 字符串内部是UTF16,因此您要为每个字符存储一个额外的字节。 So by switching to UTF8 you're cutting memory use by half (not counting class overhead, which is still significant). 因此,通过切换到UTF8,您可以将内存使用量减少一半(不计算类开销,这仍然很重要)。 Then you can convert back to normal strings as needed. 然后,您可以根据需要转换回普通字符串。

static void Main(string[] args)
{
    List<Byte[]> strings = new List<byte[]>();

    using (TextReader tr = new StreamReader(@"C:\test.log"))
    {
        string s = tr.ReadLine();
        while (s != null)
        {
            strings.Add(Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Encoding.Unicode.GetBytes(s)));
            s = tr.ReadLine();
        }
    }

    // Get strings back
    foreach( var str in strings)
    {
        Console.WriteLine(Encoding.UTF8.GetString(str));
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM