简体   繁体   中英

Efficient way to split CSV files in c#

I am trying to split a large Telecom bill which comes as a CSV file, 300MB into smaller chunks based on the Phone Number in the bill.

Some Phone Numbers have bills of 20 lines and some have more then 1000 lines, so it's dynamic. At first pass I read the bill and use LINQ to group them by the Phone Numbers and count the number of lines the bill contains for each phone number billing in the CSV file. Then insert into a List: split_id , starting line, ending line. (starting line starts from 0).

The script below is what I use to split the smaller bills. But this 300MB has unusually 7500+ phone numbers even though each file gets down to under 100KB it takes forever to process the split the bill.

    static void FileSplitWriter(List<SplitFile> pList, string info)
    {

        pList.ForEach(delegate(SplitFile per)
        {
            int startingLine = per.startingLine;
            int endingLine = per.endingLine;
            string[] fileContents = File.ReadAllLines(info);
            var query = fileContents.Skip(startingLine - 1).Take(endingLine - (startingLine - 1));
            string directoryPath = Path.GetDirectoryName(info);
            string filenameok = Path.GetFileNameWithoutExtension(info);

            StreamWriter ffs = new StreamWriter(directoryPath + "\\" + filenameok + "_split" + per.id + ".csv");
            foreach (string line in query)
            {
                ffs.WriteLine(line);
            }
            ffs.Dispose();
            ffs.Close();
        });


    }

My question is, is it possible to for this process to be much faster/efficient ? At this current rate it will take 3 hours or so to split the file alone.

It looks like the most inefficient part of this code is that you are reading the entire 300MB file into memory multiple times. You should only need to read it once ...

  1. Read the file into some enumerable data structure.
  2. Group by phone number.
  3. Loop over each group and write each to a file.

Note: if you're using .NET 4.0, you might gain some memory efficiency by using File.ReadLines() (instead of ReadAllLines).

I suggest you use one of the many fast CSV parsing libraries that exist.

There are several ones posted on code project and elsewhere, as well as filehelpers .

Try moving the read of the file to outside the loop:

 static void FileSplitWriter(List<SplitFile> pList, string info) {
    string[] fileContents = File.ReadAllLines(info);
    string directoryPath = Path.GetDirectoryName(info);
    string filenameok = Path.GetFileNameWithoutExtension(info);
    pList.ForEach(delegate(SplitFile per) {
        int startingLine = per.startingLine;
        int endingLine = per.endingLine;
        var query = fileContents.Skip(startingLine - 1).Take(endingLine - (startingLine - 1));
        StreamWriter ffs = new StreamWriter(directoryPath + "\\" + filenameok + "_split" + per.id + ".csv");
        foreach (string line in query) {
            ffs.WriteLine(line);
        }
        ffs.Close();
        ffs.Dispose();
    });
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM