简体   繁体   English

从Excel文件中替换无效的XML字符并将其写回到磁盘会导致文件损坏在MS Excel中打开时出错

[英]Replacing Invalid XML characters from an excel file and writing it back to disk causes file is corrupted error on opening in MS Excel

A little background on problem: 关于问题的一些背景:

We have an ASP.NET MVC5 Application where we use FlexMonster to show the data in grid. 我们有一个ASP.NET MVC5应用程序,我们在其中使用FlexMonster在网格中显示数据。 The data source is a stored procedure that brings all the data into the UI grid, and once user clicks on export button, it exports the report to Excel. 数据源是一个存储过程,它将所有数据带入UI网格,一旦用户单击“导出”按钮,它将报表导出到Excel。 However, in some cases export to excel is failing. 但是,在某些情况下,导出到excel失败。 Some of the data has some invalid characters, and it is not possible/feasible to fix the source as suggested here 有些数据有一些无效字符,这是不可能/不可行固定源的建议在这里

My approach so far: 到目前为止,我的方法:

EPPlus library fails on initializing the workbook as the input excel file contains some invalid XML characters. EPPlus库无法初始化工作簿,因为输入的excel文件包含一些无效的XML字符。 I could find that the file is dumped with some invalid character in it. 我发现转储文件中包含一些无效字符。 I looked into the possible approaches . 我研究了可能的方法。

Firstly, I identified the problematic character in the excel file. 首先,我在excel文件中确定了有问题的字符。 I first tried to replace the invalid character with blank space manually using Notepad++ and the EPPlus could successfully read the file. 我首先尝试使用Notepad ++手动将无效字符替换为空格,然后EPPlus可以成功读取文件。

Now using the approaches given in other SO thread here and here , I replaced all possible occurrences of invalid chars. 现在,使用此处此处其他SO线程中给出的方法,我替换了所有可能出现的无效字符。 I am using at the moment 我正在使用

XmlConvert.IsXmlChar XmlConvert.IsXmlChar

method to find out the problematic XML character and replacing with blank space. 找出有问题的XML字符并用空格替换的方法。

I created a sample program where I am trying to work on the problematic excel sheet. 我创建了一个示例程序,试图在有问题的Excel工作表上工作。

//in main method 
String readFile = File.ReadAllText(filePath);
string content = RemoveInvalidXmlChars(readFile);
File.WriteAllText(filePath, content);

//removal of invalid characters
        static string RemoveInvalidXmlChars(string inputText)  
        {
            StringBuilder withoutInvalidXmlCharsBuilder = new StringBuilder();
            int firstOccurenceOfRealData = inputText.IndexOf("<t>");
            int lastOccurenceOfRealData = inputText.LastIndexOf("</t>");

            if (firstOccurenceOfRealData < 0 ||
                lastOccurenceOfRealData < 0 ||
                firstOccurenceOfRealData > lastOccurenceOfRealData)
                return inputText;

            withoutInvalidXmlCharsBuilder.Append(inputText.Substring(0, firstOccurenceOfRealData)); 
            int remaining = lastOccurenceOfRealData - firstOccurenceOfRealData;
            string textToCheckFor = inputText.Substring(firstOccurenceOfRealData, remaining); 

            foreach (char c in textToCheckFor)
            {
                withoutInvalidXmlCharsBuilder.Append((XmlConvert.IsXmlChar(c)) ? c : ' ');
            }
      withoutInvalidXmlCharsBuilder.Append(inputText.Substring(lastOccurenceOfRealData));

            return withoutInvalidXmlCharsBuilder.ToString();

        }

If I replaces the problematic character manually using notepad++, then the file opens fine in MSExcel. 如果我使用notepad ++手动替换有问题的字符,则该文件可以在MSExcel中正常打开。 The above mentioned code successfully replaces the same invalid character and writes the content back to the file. 上面提到的代码成功替换了相同的无效字符,并将内容写回到文件中。 However, when I try to open the excel file using MS Excel, it throws an error saying that file may have been corrupted and no content is displayed (snapshots below) . 但是,当我尝试使用MS Excel打开excel文件时,它抛出一个错误,指出该文件可能已损坏并且没有内容显示(下面的快照) Moreover, Following code 此外,以下代码

var excelPackage = new ExcelPackage(new FileInfo(filePath));

on the file that I updated via Notepad++, throws following exception 通过记事本++更新的文件上,出现以下异常

"CRC error: the file being extracted appears to be corrupted. Expected 0x7478AABE, Actual 0xE9191E00"}

My Questions: 我的问题:

  1. Is my approach to modify content this way correct? 我以这种方式修改内容的方法正确吗?
  2. If yes, How can I write updated string to an Excel file? 如果是,如何将更新后的字符串写入Excel文件?
  3. If my approach is wrong then, How can I proceed to get rid of invalid XML chars? 如果我的方法是错误的,那么如何继续摆脱无效的XML字符?

Errors shown on opening file (without invalid XML char): 打开文件时显示错误(没有无效的XML字符):

First Pop up 首先弹出

在此处输入图片说明

When I click on yes 当我点击是

在此处输入图片说明

Thanks in advance ! 提前致谢 !

It does sounds like a binary (presumable XLSX) file based on your last comment. 根据您的最新评论,它听起来确实像一个二进制文件(可能是XLSX)。 To confirm, open the file created by the FlexMonster with 7zip. 要确认,请使用7zip打开由FlexMonster创建的文件。 If it opens properly and you see a bunch of XML files in folders, its a XLSX. 如果打开正确,并且您在文件夹中看到一堆XML文件,则为XLSX。

In that case, a search/replace on a binary file sounds like a very bad idea. 在这种情况下,在二进制文件上进行搜索/替换听起来是一个非常糟糕的主意。 It might work on the XML parts but might also replace legit chars in other parts. 它可能适用于XML部分,但也可能替代其他部分中的合法字符。 I think the better approach would be to do as @PanagiotisKanavos suggests and use ZipArchive . 我认为更好的方法是按照@PanagiotisKanavos的建议进行操作并使用ZipArchive But you have to do rebuild it in the right order otherwise Excel complains. 但是您必须按照正确的顺序重建它,否则Excel会抱怨。 Similar to how it was done here https://stackoverflow.com/a/33312038/1324284 , you could do something like this: 类似于此处https://stackoverflow.com/a/33312038/1324284的操作 ,您可以执行以下操作:

public static void ReplaceXmlString(this ZipArchive xlsxZip, FileInfo outFile, string oldString, string newstring)
{
    using (var outStream = outFile.Open(FileMode.Create, FileAccess.ReadWrite))
    using (var copiedzip = new ZipArchive(outStream, ZipArchiveMode.Update))
    {
        //Go though each file in the zip one by one and copy over to the new file - entries need to be in order
        foreach (var entry in xlsxZip.Entries)
        {
            var newentry = copiedzip.CreateEntry(entry.FullName);
            var newstream = newentry.Open();
            var orgstream = entry.Open();

            //Copy non-xml files over
            if (!entry.Name.EndsWith(".xml"))
            {
                orgstream.CopyTo(newstream);
            }
            else
            {
                //Load the xml document to manipulate
                var xdoc = new XmlDocument();
                xdoc.Load(orgstream);

                var xml = xdoc.OuterXml.Replace(oldString, newstring);
                xdoc = new XmlDocument();
                xdoc.LoadXml(xml);

                xdoc.Save(newstream);
            }

            orgstream.Close();
            newstream.Flush();
            newstream.Close();
        }
    }
}

When it is used like this: 像这样使用时:

[TestMethod]
public void ReplaceXmlTest()
{
    var datatable = new DataTable("tblData");
    datatable.Columns.AddRange(new[]
    {
        new DataColumn("Col1", typeof (int)),
        new DataColumn("Col2", typeof (int)),
        new DataColumn("Col3", typeof (string))
    });

    for (var i = 0; i < 10; i++)
    {
        var row = datatable.NewRow();
        row[0] = i;
        row[1] = i * 10;
        row[2] = i % 2 == 0 ? "ABCD" : "AXCD";
        datatable.Rows.Add(row);
    }

    using (var pck = new ExcelPackage())
    {
        var workbook = pck.Workbook;
        var worksheet = workbook.Worksheets.Add("source");

        worksheet.Cells.LoadFromDataTable(datatable, true);
        worksheet.Tables.Add(worksheet.Cells["A1:C11"], "Table1");

        //Now similulate the copy/open of the excel file into a zip archive
        using (var orginalzip = new ZipArchive(new MemoryStream(pck.GetAsByteArray()), ZipArchiveMode.Read))
        {
            var fi = new FileInfo(@"c:\temp\ReplaceXmlTest.xlsx");
            if (fi.Exists)
                fi.Delete();

            orginalzip.ReplaceXmlString(fi, "AXCD", "REPLACED!!");
        }
    }
}

Gives this: 给出以下内容:

在此处输入图片说明

Just keep in mind that this is completely brute force. 请记住,这完全是蛮力。 Anything you can do to make the file filter smarter rather then simply doing ALL xml files would be a very good thing. 您可以做的所有使文件筛选器更智能的工作,而不是简单地处理所有xml文件都是一件非常好的事情。 Maybe limit it to the SharedString.xml file if that is where the problem lies or in the xml files in the worksheet folders. 如果这是问题所在或工作表文件夹中的xml文件,则可以将其限制为SharedString.xml文件。 Hard to say without knowing more about the data. 在不了解更多数据的情况下很难说。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM