简体   繁体   English

如何使用OpenXml从xlsx文件中提取文本

[英]How to extract text from xlsx file using OpenXml

I need to extract the text from an xlsx file (to put into a full text index on a database). 我需要从xlsx文件中提取文本(放入数据库的全文索引)。 I am using the following code: 我使用以下代码:

using(SpreadsheetDocument d = SpreadsheetDocument.Open(stream, false)) {
 // Load the shared strings table.
 SharedStringTablePart stringTable = 
  d.WorkbookPart.GetPartsOfType<SharedStringTablePart>()
  .FirstOrDefault();
 if(stringTable == null) System.Diagnostics.Debug.WriteLine("Null string table");
 foreach(WorksheetPart part in d.WorkbookPart.WorksheetParts) {
  foreach(SheetData sheet in part.Worksheet.Elements<SheetData>()) {
   bool added = false;
   foreach(Row r in sheet.Elements<Row>()) {
    foreach(Cell c in r.Elements<Cell>()) {
     if(c.DataType != null) {
      string v = c.CellValue.Text;
      if(v != null && c.DataType.Value == CellValues.SharedString) {
       var tableEntry = stringTable.SharedStringTable.ElementAt(int.Parse(v));
       if(tableEntry != null) {
        v = tableEntry.InnerText;
       }
      }
      if(v != null) {
       if(added) b.Append('\t');
       b.Append(v);
       added = true;
      }
     }
    }
    if(added) b.AppendLine();
   }
  }
 }
}
return b.ToString();

The examples I found on the web didn't mention the shared strings table - I found out about it when I realised that no string data was being output. 我在网上找到的例子没有提到共享字符串表 - 当我意识到没有输出字符串数据时,我发现了它。

Are there any other gotchas I should know about? 还有其他我应该知道的问题吗?

Other criticisms on the code always welcome. 对代码的其他批评总是受欢迎的。

There are some tricky parts to extracting actual data from cells. 从单元格中提取实际数据有一些棘手的部分。 Sometimes it's stored there (numbers, dates, inline strings), sometimes it references SharedStringTable. 有时它存储在那里(数字,日期,内联字符串),有时它引用SharedStringTable。 I have browsed through quite bit of functions and this is what i came up with (some copied, some mine). 我已经浏览了很多功能,这就是我提出的(有些复制,有些是我的)。 You should be able to comfortably slide this into your code after 您应该能够轻松地将其滑入您的代码中

foreach(Cell c in r.Elements()) { foreach(r.Elements()中的Cell c){

like this 像这样

string v = GetValueFromCell(c, d.WorkbookPart); string v = GetValueFromCell(c,d.WorkbookPart);

        /// <summary>
        /// Return si value based on xml cell id number
        /// </summary>
        /// <param name="workbookPart"></param>
        /// <param name="id"></param>
        /// <returns>SharedStringItem for interpretation</returns>
        public static SharedStringItem GetSharedStringItemById(WorkbookPart workbookPart, int id)
        {
            return workbookPart.SharedStringTablePart.SharedStringTable.Elements<SharedStringItem>().ElementAt(id);
        }

        /// <summary>
        /// Return value from the cell based on the cell's information (innards and/or id)
        /// </summary>
        /// <param name="cell">spreadhseet cell</param>
        /// <param name="workbookPart">work book from uploaded file</param>
        /// <returns>string value of the cell</returns>
        public static string GetValueFromCell(Cell cell, WorkbookPart workbookPart)
        {
            int id;
            string cellValue = cell.InnerText;

            if (cellValue.Trim().Length > 0)
            {
                if (cell.DataType != null)
                {
                    switch (cell.DataType.Value)
                    {
                        case CellValues.SharedString:

                            Int32.TryParse(cellValue, out id);
                            SharedStringItem item = GetSharedStringItemById(workbookPart, id);
                            if (item.Text != null)
                            {
                                cellValue = item.Text.Text;
                            }
                            else if (item.InnerText != null)
                            {
                                cellValue = item.InnerText;
                            }
                            else if (item.InnerXml != null)
                            {
                                cellValue = item.InnerXml;
                            }
                            break;

                        case CellValues.Boolean:
                            switch (cellValue)
                            {
                                case "0":
                                    cellValue = "FALSE";
                                    break;
                                default:
                                    cellValue = "TRUE";
                                    break;
                            }
                            break;
                    }
                }

                else
                {
                    int excelDate;
                    if (Int32.TryParse(cellValue, out excelDate))
                    {

                        var styleIndex = (int)cell.StyleIndex.Value;

                        var cellFormats = workbookPart.WorkbookStylesPart.Stylesheet.CellFormats;
                        var numberingFormats = workbookPart.WorkbookStylesPart.Stylesheet.NumberingFormats;
                        var cellFormat = (CellFormat)cellFormats.ElementAt(styleIndex);

                        if (cellFormat.NumberFormatId != null)
                        {

                            var numberFormatId = cellFormat.NumberFormatId.Value;
                            var numberingFormat = numberingFormats.Cast<NumberingFormat>().SingleOrDefault(f => f.NumberFormatId.Value == numberFormatId);

                            if (numberingFormat != null && numberingFormat.FormatCode.Value.Contains("/yy")) //TODO here i should think of locales
                            {
                                DateTime dt = DateTime.FromOADate(excelDate);
                                cellValue = dt.ToString("MM/dd/yyyy");
                            }
                        }
                    }
                }
            }
            return cellValue;
        }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 openXML 深度克隆 .xlsx 文件中的行? - How can I deep clone row in .xlsx file using openXML? 如何使用OpenXml SDK 2.0“从文件插入文本” - How to “Insert text from file” using OpenXml SDK 2.0 package 多个 openxml 格式 xml 如何仅使用字符串转换为一个没有 OpenXml SDK 的 xlsx 文件? - How package multiple openxml format xml to one xlsx file without OpenXml SDK just using string? 使用C#和OpenXML写入xlsx文件时,会修剪Excel单元格文本 - Excel cell text is trimmed when writen in xlsx file when using C# & OpenXML 我将如何使用DocumentFormat.OpenXml从docx文件中提取数据-详情如下 - How will i extract the data from the docx file using DocumentFormat.OpenXml -details below 在Presentation OpenXml中从SmartArt提取文本 - Extract text from SmartArt in Presentation OpenXml 如何使用带有c#的OpenXML SDK v2.0将新工作表添加到Excel .xlsx文件中? - How do I add a new sheet to an Excel .xlsx file using the OpenXML SDK v2.0 with c#? 以编程方式另存为XLSX的OpenXML访问xls文件 - OpenXML access xls file saved as XLSX programmatically OpenXML,SAX和只需读取Xlsx文件 - OpenXML, SAX, and Simply Reading an Xlsx file 如何使用 OpenXML 突出显示句子中的文本? - How to highlight text in a sentence using OpenXML?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM