简体   繁体   English

如何读取Excel工作表并异步写入文件?

[英]How to read Excel worksheets and write to a file asynchronously?

I have been given a very large Excel workbook that contains 500+ sheets. 我得到了一个非常大的Excel工作簿,其中包含500多个工作表。 Each sheet represents a store, and the rows contain transactions at that store. 每张纸代表一个商店,并且行包含该商店的交易。 Each sheet layout is identical. 每个工作表布局都是相同的。 I have been requested to write a program that loops through each sheet, pulls specific transaction data, and writes everything to one gigantic CSV file. 我被要求编写一个程序,循环遍历每张工作表,提取特定的交易数据,并将所有内容都写入一个巨大的CSV文件中。

I know that this kind of functionality is far better suited for relational database, but I have been asked to work on this as is. 我知道这种功能更适合关系数据库,但是我被要求按原样进行工作。

I have written a program that successfully parses the data and writes it. 我编写了一个程序,可以成功解析数据并将其写入。 The problem is that it takes almost a half-hour to complete the file write when reading and writing data synchronously. 问题在于,在同步读写数据时,几乎需要半小时才能完成文件写入。

I would like to accomplish this task by reading and writing the data from each sheet asynchronously. 我想通过异步读取和写入每个工作表中的数据来完成此任务。 In C#, I would prefer to use the Task Parallel library for this, but am open to other options. 在C#中,我更愿意为此使用Task Parallel库,但是可以使用其他选项。

I am thinking about spinning off the worker threads from a foreach loop, like so: 我正在考虑从foreach循环中分离出工作线程,如下所示:

foreach( Worksheet ws in _excelApp.Worksheets)
{
    Parallel.Invoke(()=>ExportWorksheet(ws));
}

And then in the method (shortened for brevity): 然后在方法中(为简便起见,简称:):

private void ExportWorksheet(Worksheet ws)
{         
     using(FileStream fs = new new FileStream(fi.FullName, FileMode.Append, FileAccess.Write, FileShare.Write, 1, true))
     {
         for(int row = 1; row < 300; row++)
         {
              for(int column = 1; column < 20)
              {
                   byte[] bytes = Encoding.ASCII.GetBytes(ws.Cells[row, column].Value.ToString() + ",");
                   fs.Write(bytes, 0, bytes.count());
              }

              fs.Write(Encoding.ASCII.GetBytes("\n"), 0, 2);
         } 
     }

}

This gives me strange results, of course. 当然,这给了我奇怪的结果。

Am I on the right track? 我在正确的轨道上吗? Should I be using a different encoding? 我应该使用其他编码吗? Is there a cleaner way to accomplish the async write? 有没有更干净的方法来完成异步写入? Are there any threading rules being broken here? 这里是否有任何违反线程的规则?

All suggestions are welcome. 欢迎所有建议。 Thanks for the help. 谢谢您的帮助。

Instead of looping through the rows and columns you'd better use the Value property of a range (for example the ActiveRange of a WorkSheet). 与其遍历行和列,不如使用范围的Value属性(例如,WorkSheet的ActiveRange)。 This contains a two dimensional array containing all the data. 它包含一个包含所有数据的二维数组。 This increases reading performance with a factor 1000. 这将读取性能提高了1000倍。

For the other part. 对于另一部分。 I rewrote it in two parts, removing the Excel references: 我重写了两部分,删除了Excel引用:

        DateTime start = DateTime.Now;

        //using (FileStream fs = new FileStream(@"C:\temp\x.x", FileMode.Append, FileAccess.Write, FileShare.Write, 1, true))
        //{
        //    for (int row = 1; row < 3 * 1000; row++)
        //    {
        //        for (int column = 1; column < 3 * 1000; column++)
        //        {
        //            byte[] bytes = Encoding.ASCII.GetBytes(1.ToString() + ",");
        //            fs.Write(bytes, 0, bytes.Length);
        //        }

        //        byte[] bytes2 = Encoding.ASCII.GetBytes("\n");
        //        fs.Write(bytes2, 0, bytes2.Length);
        //    }
        //}

        using (TextWriter tw = new StreamWriter(new FileStream(@"C:\temp\x.x", FileMode.Append, FileAccess.Write, FileShare.Write, 1, true)))
        {
            for (int row = 1; row < 3 * 1000; row++)
            {
                for (int column = 1; column < 3 * 1000; column++)
                {
                    tw.Write(1.ToString());
                    tw.Write(',');
                }

                tw.WriteLine();
            }
        }

        DateTime end = DateTime.Now;

        MessageBox.Show(string.Format("Time spent: {0:N0} ms.", (end - start).TotalMilliseconds));

The first part (which is almost identical to your code, now commented out) takes 3.670 (yes, over three thousand) seconds. 第一部分(与您的代码几乎完全相同,现在已注释掉)需要3.670(是,超过三千)秒。 The second part (not commented out) takes 12 seconds. 第二部分(未注释掉)需要12秒。

My experience with reading Excel from C# is, generally, nasty. 我从C#读取Excel的经验通常令人讨厌。 All your computing time is spent trafficking with Excel - writing out CSV files takes no time at all. 您所有的计算时间都花在了用Excel进行交易上-写出CSV文件完全不需要时间。 It isn't worth bothering with the separate threads. 单独的线程不值得打扰。

In some cases I simply saved the spreadsheet as .csv and then parsed it from there. 在某些情况下,我只是将电子表格另存为.csv,然后从那里进行解析。 How this works from multiple sheets I don't know, but you might be able to page through the sheets saving them to .CSVs one by one. 我不知道这是如何在多个工作表中进行的,但是您也许可以分页浏览工作表,将它们保存为.CSV格式。 Then, read the .CSVs as long strings and clean them up. 然后,将.CSVs读取为长字符串并清理它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM