简体   繁体   English

C#字典和高效的内存使用

[英]C# Dictionary and Efficient Memory Usage

I have a tool to compare 2 csv files and then bucket each cell into one of the 6 buckets. 我有一个工具来比较2个csv文件,然后将每个单元格装入6个桶中的一个。 Basically, it reads in the csv files (using fast csv reader, credit: http://www.codeproject.com/KB/database/CsvReader.aspx ) and then creates a dictionary pertaining to each file based on the keys provided by the user. 基本上,它读取csv文件(使用快速csv阅读器,信用: http//www.codeproject.com/KB/database/CsvReader.aspx ),然后根据提供的密钥创建一个与每个文件有关的字典。用户。 I then iterate through th dictionaries comparing the values and writing a result csv file. 然后我遍历比较值并写入结果csv文件的字典。

While it is blazing fast, it is very inefficient in terms of memory usage. 虽然速度非常快,但在内存使用方面效率非常低。 I cannot compare more than 150 MB files on my box with 3 GB physical memory. 我无法比较我的盒子上超过150 MB的文件和3 GB的物理内存。

Here is a code snippet to read the expected file. 这是一个用于读取预期文件的代码段。 At the end of this piece, the memory usage is close to 500 MB from the task manager. 在这篇文章的最后,任务管理器的内存使用量接近500 MB。

// Read Expected
long rowNumExp;
System.IO.StreamReader readerStreamExp = new System.IO.StreamReader(@expFile);
SortedDictionary<string, string[]> dictExp = new SortedDictionary<string, string[]>();
List<string[]> listDupExp = new List<string[]>();
using (CsvReader readerCSVExp = new CsvReader(readerStreamExp, hasHeaders, 4096))
{
    readerCSVExp.SkipEmptyLines = false;
    readerCSVExp.DefaultParseErrorAction = ParseErrorAction.ThrowException;
    readerCSVExp.MissingFieldAction = MissingFieldAction.ParseError;
    fieldCountExp = readerCSVExp.FieldCount;                
    string keyExp;
    string[] rowExp = null;
    while (readerCSVExp.ReadNextRecord())
    {
        if (hasHeaders == true)
        {
            rowNumExp = readerCSVExp.CurrentRecordIndex + 2;
        }
        else
        {
            rowNumExp = readerCSVExp.CurrentRecordIndex + 1;
        }
        try
        {
            rowExp = new string[fieldCount + 1];                    
        }
        catch (Exception exExpOutOfMemory)
        {
            MessageBox.Show(exExpOutOfMemory.Message);
            Environment.Exit(1);
        }                
        keyExp = readerCSVExp[keyColumns[0] - 1];
        for (int i = 1; i < keyColumns.Length; i++)
        {
            keyExp = keyExp + "|" + readerCSVExp[i - 1];
        }
        try
        {
            readerCSVExp.CopyCurrentRecordTo(rowExp);
        }
        catch (Exception exExpCSVOutOfMemory)
        {
            MessageBox.Show(exExpCSVOutOfMemory.Message);
            Environment.Exit(1);
        }
        try
        {
            rowExp[fieldCount] = rowNumExp.ToString();
        }
        catch (Exception exExpRowNumOutOfMemory)
        {
            MessageBox.Show(exExpRowNumOutOfMemory.Message);
            Environment.Exit(1);
        }
        // Dedup Expected                        
        if (!(dictExp.ContainsKey(keyExp)))
        {
            dictExp.Add(keyExp, rowExp);                        
        }
        else
        {
            listDupExp.Add(rowExp);
        }                    
    }                
    logFile.WriteLine("Done Reading Expected File at " + DateTime.Now);
    Console.WriteLine("Done Reading Expected File at " + DateTime.Now + "\r\n");
    logFile.WriteLine("Done Creating Expected Dictionary at " + DateTime.Now);
    logFile.WriteLine("Done Identifying Expected Duplicates at " + DateTime.Now + "\r\n");                
}

Is there anything, I could do to make it more memory efficient? 有什么东西,我能做些什么来提高内存效率? Anything I could do differently above, to consume less mermory? 我能做些什么不同的东西,消耗更少的mermory?

Any ideas are welcome. 欢迎任何想法。

Thanks guys for all the feedback. 谢谢大家的反馈。

I have incorporated the changes as suggested to store the index of the row instead of the row itself in the dictionaries. 我已按照建议的方式合并了更改,以便在字典中存储行的索引而不是行本身。

Here is the same code fragment with the new implementation. 这是与新实现相同的代码片段。

// Read Expected
        long rowNumExp;
        SortedDictionary<string, long> dictExp = new SortedDictionary<string, long>();
        System.Text.StringBuilder keyExp = new System.Text.StringBuilder();
        while (readerCSVExp.ReadNextRecord())
        {
            if (hasHeaders == true)
            {
                rowNumExp = readerCSVExp.CurrentRecordIndex + 2;
            }
            else
            {
                rowNumExp = readerCSVExp.CurrentRecordIndex + 1;
            }
            for (int i = 0; i < keyColumns.Length - 1; i++)
            {
                keyExp.Append(readerCSVExp[keyColumns[i] - 1]);
                keyExp.Append("|");
            }
            keyExp.Append(readerCSVExp[keyColumns[keyColumns.Length - 1] - 1]);
            // Dedup Expected                       
            if (!(dictExp.ContainsKey(keyExp.ToString())))
            {
                dictExp.Add(keyExp.ToString(), rowNumExp);
            }
            else
            {
                // Process Expected Duplicates          
                string dupExp;
                for (int i = 0; i < fieldCount; i++)
                {
                    if (i >= fieldCountExp)
                    {
                        dupExp = null;
                    }
                    else
                    {
                        dupExp = readerCSVExp[i];
                    }
                    foreach (int keyColumn in keyColumns)
                    {
                        if (i == keyColumn - 1)
                        {
                            resultCell = "duplicateEXP: '" + dupExp + "'";
                            resultCell = CreateCSVField(resultCell);
                            resultsFile.Write(resultCell);
                            comSumCol = comSumCol + 1;
                            countDuplicateExp = countDuplicateExp + 1;
                        }
                        else
                        {
                            if (checkPTColumns(i + 1, passthroughColumns) == false)
                            {
                                resultCell = "'" + dupExp + "'";
                                resultCell = CreateCSVField(resultCell);
                                resultsFile.Write(resultCell);
                                countDuplicateExp = countDuplicateExp + 1;
                            }
                            else
                            {
                                resultCell = "PASSTHROUGH duplicateEXP: '" + dupExp + "'";
                                resultCell = CreateCSVField(resultCell);
                                resultsFile.Write(resultCell);
                            }
                            comSumCol = comSumCol + 1;
                        }
                    }
                    if (comSumCol <= fieldCount)
                    {
                        resultsFile.Write(csComma);
                    }
                }
                if (comSumCol == fieldCount + 1)
                {
                    resultsFile.Write(csComma + rowNumExp);
                    comSumCol = comSumCol + 1;
                }
                if (comSumCol == fieldCount + 2)
                {
                    resultsFile.Write(csComma);
                    comSumCol = comSumCol + 1;
                }
                if (comSumCol > fieldCount + 2)
                {
                    comSumRow = comSumRow + 1;
                    resultsFile.Write(csCrLf);
                    comSumCol = 1;
                }
            }
            keyExp.Clear();
        }
        logFile.WriteLine("Done Reading Expected File at " + DateTime.Now + "\r\n");
        Console.WriteLine("Done Reading Expected File at " + DateTime.Now + "\r\n");
        logFile.WriteLine("Done Analyzing Expected Duplicates at " + DateTime.Now + "\r\n");
        Console.WriteLine("Done Analyzing Expected Duplicates at " + DateTime.Now + "\r\n");
        logFile.Flush();

However, the problem is I need both the data sets in memory. 但是,问题是我需要内存中的数据集。 I actually iterate through both the dictionaries looking for matches, mismatches, duplicates and dropouts based on the key. 我实际上遍历两个字典,根据键查找匹配,不匹配,重复和丢失。

Using the this approach of storing the row index, I am still using a lot of memory because for dynamic access I have to now use cached version of the csv reader. 使用这种存储行索引的方法,我仍然使用大量内存,因为对于动态访问,我现在必须使用csv阅读器的缓存版本。 So although the dictionary is much smaller now, the caching of data makes up for the savings and I still ended up with about similar memory usage. 因此,尽管字典现在要小得多,但数据的缓存弥补了节省的成本,我仍然得到了类似的内存使用量。

Hope, I am making sense...:) 希望,我有意义...... :)

One option is to get rid of the dictionary entirely and just loop through the 2 files, but not sure if the performance would be as fast as comparing 2 dictionaries. 一种选择是完全摆脱字典并只循环浏览2个文件,但不确定性能是否与比较2个字典一样快。

Any inputs are much appreciated. 任何投入都非常感谢。

You could replace keyExp by a StringBuilder. 您可以用StringBuilder替换keyExp reallocating the string in a loop like that will keep allocating more memory as strings are immutable. 在这样的循环中重新分配字符串将继续分配更多的内存,因为字符串是不可变的。

StringBuilder keyExp = new StringBuilder();
...
    keyExp.Append("|" + readerCSVExp[i - 1]) ;
... 

are a lot of the strings the same? 很多字符串是一样的吗? you could try interning them , then any identical strings will share the same memory rather than being copies... 您可以尝试实习它们 ,然后任何相同的字符串将共享相同的内存而不是副本...

rowExp[fieldCount] = String.Intern(rowNumExp.ToString()); 

// Dedup Expected               
string internedKey = (String.Intern(keyExp.ToString()));        
if (!(dictExp.ContainsKey(internedKey)))
{
   dictExp.Add(internedKey, rowExp);                        
}
else
{
   listDupExp.Add(rowExp);
}  

I'm not certain exactly how the code works but...beyond that I'd say you don't need to keep rowExp in the dictionary, keep something else, like a number and write rowExp back out to disk in another file. 我不确定代码是如何工作的,但除此之外我不会说你不需要将rowExp保留在字典中,保留其他东西,比如数字并将rowExp写回另一个文件中的磁盘。 This will probably save you the most memory as this seems to be an array of strings from the file so is probably big. 这可能会节省你最多的内存,因为这似乎是文件中的一个字符串数组,所以可能很大。 If you write it to a file and keep the number in the file its at then you can get back to it again in the future if you then need to process. 如果您将其写入文件并将其保存在文件中,那么如果您需要处理,将来可以再次返回该文件。 If you saved the offset in the file as the value in the dictionary you,d be able to find it again quickly. 如果您将文件中的偏移量保存为字典中的值,则可以快速找到它。 Maybe :). 也许 :)。

Tell me if I get anything wrong. 如果我弄错了,请告诉我。

The code above reads one CSV file and looks for duplicate keys. 上面的代码读取一个CSV文件并查找重复的密钥。 Each row goes into one of two sets, ones for duplicate keys, and one without. 每行进入两组中的一组,一组用于重复键,一组用于没有。

What do you do with these rowsets? 你如何处理这些行集?

Are they written to different files? 他们写的是不同的文件吗?

If so there's no reason to store the non-unqiue rows in a list, as you find them write them to a file. 如果是这样,没有理由将非unqiue行存储在列表中,因为您发现它们将它们写入文件。

When you do find duplicates, there's no need to store the entire row, just store the key, and write the row to file (obviously a different file if you want to keep them seperate). 当你找到重复项时,不需要存储整行,只需存储密钥,并将行写入文件(如果你想让它们保持独立,显然是一个不同的文件)。

If you need to do further processing on the different sets, then instead of storing the entire row, when not store the row number. 如果需要对不同的集合进行进一步处理,则不存储整行,而不是存储行号。 Then when you do what ever it is you do with the rows, you have the row number necessarry to fetch the row again. 然后,当你对行进行任何操作时,你需要行号才能再次获取行。

NB: rather than storing a row number, you can store the offset in the file of the start point of the row. 注意:您可以将偏移量存储在行的起始点的文件中,而不是存储行号。 Then you can access the file and read rows randomly, if you need. 然后,如果需要,您可以随机访问该文件并读取行。

Just comment this answer with any questions (or clarifications) you might have, I'll update the answer, I'll be here for another couple of hours anyway. 只要对你可能有的任何问题(或澄清)评论这个答案,我会更新答案,无论如何我会在这里待几个小时。

Edit 编辑
You can reduce the memory foot print further by not storing the keys, but storing the hashes of the keys. 您可以通过不存储键来进一步减少内存占用,而是存储键的哈希值。 If you find a duplicate, seek to that position in the file, re-read the row and compare the actual keys. 如果您发现重复,请在文件中寻找该位置,重新读取该行并比较实际的键。

If you haven't already get a profiler on this like DotTrace to see whic objects are using the memory, that'll give you a good idea of what needs optimising. 如果您还没有像DotTrace那样获得一个探查器来查看哪些对象正在使用内存,那么这将让您对需要优化的内容有所了解。

Some ideas from looking at the code: 查看代码的一些想法:

Do you need to store the listDupExp? 你需要存储listDupExp吗? Seems to me with list you're effectively loading both files into memory so 2 x 150MB + some overhead could easily approach 500MB in task manager. 在我看来,列表中你有效地将两个文件加载到内存中,因此2 x 150MB +一些开销在任务管理器中很容易接近500MB。

Secondly, can you start writing the output before you've read all the input? 其次,你可以在阅读所有输入之前开始编写输出吗? I'm presume this is tricky as it looks like you need all the output items sorted before you write them out, but may be something you could look at. 我认为这很棘手,因为看起来你需要在写出之前对所有输出项进行排序,但可能是你可以看到的东西。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM