简体   繁体   中英

Efficient Convert XML to Excel, selecting some fields

My Problem

I need to read 1000000 XML files , and each file , extract some information , then create a spreadsheet with this information. Currently I have a code that works , however, it takes about one hour ... and I need this to be quickly generated

XML file example

http://pastebin.com/M5uvVaYt (its too big)

My Current Code

        string[] arquivosArr = Directory.GetFiles(@"D:\ALL_FILES", "*.xml", SearchOption.AllDirectories);
        List<string> arquivos = new List<string>(arquivosArr);
        XNamespace ns = "http://www.portalfiscal.inf.br/nfe";


        //EXCEL OBJ
        var excel = new Application();
        excel.DisplayAlerts = false;

        var workbooks = excel.Workbooks;
        var workbook = workbooks.Add(Type.Missing);
        var worksheets = workbook.Sheets;
        var worksheet = (Worksheet)worksheets[1];

        worksheet.Columns[58].NumberFormat = "@";

        var watch = System.Diagnostics.Stopwatch.StartNew();

        int i = 0;
        Parallel.ForEach(arquivos, arquivo =>            
        {
             try
            {
                var doc = XDocument.Load(arquivo);

                if (doc.Root.Name.LocalName == "nfeProc")
                {
                    var chave = doc.Descendants(ns + "chNFe").First().Value;
                    var itens = doc.Descendants(ns + "det");
                    //var info3 = .......
                    //var info4 = .......
                    //var info5 = .......
                    //var info6 = .......
                    //var info7 = .......
                    //var info8 = .......
                    //etc......



                    int starts = i;
                    Interlocked.Add(ref i, itens.Count());
                    foreach (var item in itens)
                    {
                        var data = new object[1, 58];
                        //data[0, 0] = .....
                        //data[0, 1] = .....
                        //data[0, 2] = .....
                        //data[0, 3] = .....
                        //data[0, 4] = .....
                        //data[0, 5] = .....
                        //data[0, 6] = .....
                        data[0, 27] = item.Attribute("nItem").Value;
                        data[0, 57] = chave;
                        var startCell = (Range)worksheet.Cells[(starts + 1), 1];
                        var endCell = (Range)worksheet.Cells[(starts + 1), 58];
                        var writeRange = worksheet.Range[startCell, endCell];
                        writeRange.Value2 = data;
                        starts++;
                    }


                        double perc = ((i + 1.00) / arquivos.Count) * 100;
                        Console.WriteLine("Add: " + (i + 1) + " (" + Math.Round(perc, 2) + "%)");                    


                }

            }
            catch (XmlException ex)
            {
                Console.WriteLine(ex.Message);
            }



            });

        watch.Stop();
        var elapsedMs = watch.ElapsedMilliseconds;
        Console.WriteLine(elapsedMs / 1000.0);



        workbook.SaveAs(@"D:\MY_INFO.xls");
        workbook.Close();
        excel.Quit();

I'm new to C# , so I apologize for my code

a million files in an hour? How are you expecting to get better than that? your currently processing 277 files a second!

you will need to run multiple processes and or machines writing to separate file and then compile them at the end to achieve any big improvement.

This question might be better suited for codereview since your code currently works. That being said, I can make the following suggestions:

  1. Do not do Console.Writeline() inside Parallel.ForEach() for each of your 1000000 files! It is slow , and blocking .

    Instead, consider outputting a heartbeat message once every ten seconds or so, and doing so from a separate thread that will not interfere with your XML processing threads. For instance, see NonBlockingConsole from here .

  2. Rather than loading each XML file into an XDocument , you could stream through each file with an XmlReader loading only the minimum required into memory at a given time, along the lines of How to: Stream XML Fragments from an XmlReader . This should improve performance directly by skipping construction of unneeded XElement subtrees and indirectly by reducing GC pressure.

    The following method streams through one of your XML files and returns your selected values in an object[,] table :

     const int ColumnLength = 58; const int ChaveIndex = 57; const int ItemIndex = 27; static bool TryExtractTable(string arquivo, out object[,] table) { XNamespace ns = "http://www.portalfiscal.inf.br/nfe"; var rootName = ns + "nfeProc"; var chaveName = ns + "chNFe"; var itemsName = ns + "det"; try { using (var reader = XmlReader.Create(arquivo)) { // Move to the root element, verify it's correct. if (!reader.ReadToElement() || reader.XName() != rootName) { table = null; return false; } string chaveValue = null; List<object> itemValues = new List<object>(); bool alreadyReadNext = false; while (alreadyReadNext || reader.Read()) { alreadyReadNext = false; if (reader.NodeType != XmlNodeType.Element) continue; var name = reader.XName(); if (chaveValue == null && name == chaveName) { chaveValue = ((XElement)XNode.ReadFrom(reader)).Value; // XNode.ReadFrom advances the reader to the next node after the end of the current element. // Thus a subsequent call to reader.Read() would skip this node, and so should not be made. alreadyReadNext = true; } else if (name == itemsName) { // Access the "nItem" attribute directly. var itemValue = reader["nItem"]; itemValues.Add(itemValue); } } if (itemValues.Count > 0) { var nRows = itemValues.Count; table = new object[nRows, ColumnLength]; for (int iRow = 0; iRow < nRows; iRow++) { table[iRow, ChaveIndex] = chaveValue; table[iRow, ItemIndex] = itemValues[iRow]; } return true; } } } catch (Exception ex) { Console.WriteLine(ex.Message); } table = null; return false; } 

    Using the extension methods:

     public static class XmlReaderExtensions { public static XName XName(this XmlReader reader) { return System.Xml.Linq.XName.Get(reader.LocalName, reader.NamespaceURI); } public static bool ReadToElement(this XmlReader reader) { while (reader.NodeType != XmlNodeType.Element) if (!reader.Read()) return false; return true; } } 
  3. Excel COM Interop is also slow, because each method call is actually an RPC call into another process. Thus, rather than writing each row to Excel individually, you should create a 2d array with all rows from a given XML file (as I did above), and write that 2d array to Excel in a single block. To do this, see for instance Write Array to Excel Range or Excel Interop - Efficiency and performance or Microsoft.Office.Interop.Excel really slow .

    You might also consider chunking the individual tables up into larger pieces before writing to Excel to further reduce the number of Interop calls. If you have 1000000 files, that's going to be at least 1000000 RPC calls.

  4. Also, Excel COM Interop is apparently not truly multithreaded, according to this answer and also this one . Instead it is apartment threaded and calls are marshalled to the thread that created the COM object from other threads if needed.

    Thus, consider changing your threading strategy to use a multiple producer / single consumer queue, along the lines of this question or this one .

    In your producer threads, extract the necessary table of data from each XML file as an object [,] . In your single consumer thread, open the Excel file, consume each object [,] table and write it to the Excel file as a single 2d range, then finally close the file. This should avoid any marshalling penalty.

  5. Now that you are writing to Excel from a single thread, consider abandoning Excel COM Interop entirely and writing directly to a file using of the options from Import and Export Excel - What is the best library? or Create Excel (.XLS and .XLSX) file from C# . It could even be a simple CSV file!

Of these, #1 - #3 look straightforward to test given your current code structure. #4 and #5 would be more challenging.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM