简体   繁体   English

在C#中分割CSV文件的有效方法

[英]Efficient way to split CSV files in c#

I am trying to split a large Telecom bill which comes as a CSV file, 300MB into smaller chunks based on the Phone Number in the bill. 我正在尝试根据帐单中的电话号码将CSV文件(300MB)的大型电信帐单拆分为较小的块。

Some Phone Numbers have bills of 20 lines and some have more then 1000 lines, so it's dynamic. 有些电话号码的帐单为20行,有些电话号码的行数超过1000行,因此它是动态的。 At first pass I read the bill and use LINQ to group them by the Phone Numbers and count the number of lines the bill contains for each phone number billing in the CSV file. 初次通过时,我阅读了帐单,并使用LINQ将其按电话号码分组,并为CSV文件中的每个电话号码帐单计算了帐单包含的行数。 Then insert into a List: split_id , starting line, ending line. 然后插入List:split_id,开始行,结束行。 (starting line starts from 0). (起始行从0开始)。

The script below is what I use to split the smaller bills. 下面的脚本是我用来分割较小的钞票的脚本。 But this 300MB has unusually 7500+ phone numbers even though each file gets down to under 100KB it takes forever to process the split the bill. 但是,即使每个文件的大小降至100KB以下,这300MB的电话号码通常也有7500多个,这需要花费永远的时间来处理拆分账单。

    static void FileSplitWriter(List<SplitFile> pList, string info)
    {

        pList.ForEach(delegate(SplitFile per)
        {
            int startingLine = per.startingLine;
            int endingLine = per.endingLine;
            string[] fileContents = File.ReadAllLines(info);
            var query = fileContents.Skip(startingLine - 1).Take(endingLine - (startingLine - 1));
            string directoryPath = Path.GetDirectoryName(info);
            string filenameok = Path.GetFileNameWithoutExtension(info);

            StreamWriter ffs = new StreamWriter(directoryPath + "\\" + filenameok + "_split" + per.id + ".csv");
            foreach (string line in query)
            {
                ffs.WriteLine(line);
            }
            ffs.Dispose();
            ffs.Close();
        });


    }

My question is, is it possible to for this process to be much faster/efficient ? 我的问题是,这个过程是否可能更快/更有效? At this current rate it will take 3 hours or so to split the file alone. 以目前的速度,单独分割文件将需要3个小时左右。

It looks like the most inefficient part of this code is that you are reading the entire 300MB file into memory multiple times. 它看起来像这样的代码的最没有效率的部分是,你正在阅读的整个300MB的文件到内存中多次 You should only need to read it once ... 您只需要阅读一次...

  1. Read the file into some enumerable data structure. 将文件读入一些可枚举的数据结构。
  2. Group by phone number. 按电话号码分组。
  3. Loop over each group and write each to a file. 遍历每个组并将每个组写入文件。

Note: if you're using .NET 4.0, you might gain some memory efficiency by using File.ReadLines() (instead of ReadAllLines). 注意:如果使用的是.NET 4.0,则可以通过使用File.ReadLines() (而不是ReadAllLines)来提高内存效率。

I suggest you use one of the many fast CSV parsing libraries that exist. 我建议您使用现有的许多快速CSV解析库之一。

There are several ones posted on code project and elsewhere, as well as filehelpers . 在代码项目和其他地方有几篇文章以及filehelpers

Try moving the read of the file to outside the loop: 尝试将文件的读取移到循环外:

 static void FileSplitWriter(List<SplitFile> pList, string info) {
    string[] fileContents = File.ReadAllLines(info);
    string directoryPath = Path.GetDirectoryName(info);
    string filenameok = Path.GetFileNameWithoutExtension(info);
    pList.ForEach(delegate(SplitFile per) {
        int startingLine = per.startingLine;
        int endingLine = per.endingLine;
        var query = fileContents.Skip(startingLine - 1).Take(endingLine - (startingLine - 1));
        StreamWriter ffs = new StreamWriter(directoryPath + "\\" + filenameok + "_split" + per.id + ".csv");
        foreach (string line in query) {
            ffs.WriteLine(line);
        }
        ffs.Close();
        ffs.Dispose();
    });
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM