简体   繁体   中英

Parsing a big CSV file C# .net 4

I know this question has been asked before, but I can't seem to get it working with the answers I've read. I've got a CSV file ~ 1.2GB , If I'm running the process like a 32bit i get outOfMemoryException, it works if i run it as a 64bit process, but it still takes 3,4gb in memory, i do know that I'm storing a lot of data in my customData class, but still 3,4gb of ram?, Am I doing something wrong when reading the file? dict is a dictionary in which i just have a mapping to which property to save something in, depending on the column it's in. Am i doing the reading the right way?

StreamReader reader = new StreamReader(File.OpenRead(path));
while(!reader.EndOfStream)  {
            String line = reader.ReadLine();
            String[] values = line.Split(';');
            CustomData data = new CustomData();
            string value;
            for (int i = 0; i < values.Length; i++) {
                dict.TryGetValue(i, out value);
                Type targetType = data.GetType();
                PropertyInfo prop = targetType.GetProperty(value);
                if(values[i]==null)
                {
                    prop.SetValue(data, "NULL",null);
                }
                else
                {
                    prop.SetValue(data, values[i], null);
                }

            }
            dataList.Add(data);
        }

There doesn't seem to be anything wrong in your usage of the stream reader, you read a line in memory, then forget it.

However, in C# a string is encoded in memory as UTF-16 so on the average a character consumes 2 bytes in memory.

If your CSV contains also a lot of empty fields that you convert to "NULL" you add up to 7 bytes for each empty field.

So on the whole, since you basically store all the data from your file in memory, it's not really surprising that you require almost 3 times the size of the file in memory.

The actual solution is to parse your data by chucks of N lines, treat them, and free them from memory.

Note: Consider using a CSV parser, there is more to CSV than just comas or semi-colons, what if one of your field conatins a semi-colon, a newline, a quote... ?

Edit

Actually each string take up to 20+(N/2)*4 bytes in memory see C# in Depth

Ok a couple of points here.

  • As pointed out in the comments, .NET under x86 can only consume 1.5GBytes per process, so consider that your maximum memory in 32 bit

  • The StreamReader itself will have an overhead. I don't know if it caches the entire file in memory, or not (maybe someone can clarify?). If so, reading and processing the file in chunks might be a better solution

  • The CustomData class, how many fields does it have, and how many instances are created? Note you will need 32bits for each reference in x86 and 64 bits for each reference in x64. So if you have CustomData class, which has 10 fields of type System.Object, each CustomData class before storing any data requires 88 bytes.

  • The dataList.Add at the end. I assume you are adding to a generic List? If so, note that List employes a doubling algorithm to resize. If you have 1GByte in a List and it requires 1 more byte in size, it will create a 2GByte array and copy the 1GByte to the 2GByte array on resize. So all of a sudden the 1GByte + 1 byte actually requires 3GBytes to manipulate. Another alternative is to use a pre-sized array

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM