如何提高OpenXml Excel电子表格工具中从SharedStringTable检索值的性能？

Question

I'm using DocumentFormat.OpenXml to read an Excel spreadsheet. 我正在使用DocumentFormat.OpenXml读取Excel电子表格。 I have a performance bottleneck with the code used to look up the cell value from the SharedStringTable object (it seems to be some sort of lookup table for cell values): 我的代码存在性能瓶颈，该代码用于从SharedStringTable对象中查找单元格值（这似乎是某种针对单元格值的查找表）：

var returnValue = sharedStringTablePart.SharedStringTable.ChildElements.GetItem(parsedValue).InnerText;

I've created a dictionary to ensure I only retrieve a value once: 我创建了一个字典以确保只检索一次值：

if (dictionary.ContainsKey(parsedValue))
{
    return dictionary[parsedValue];
}

var fetchedValue = sharedStringTablePart.SharedStringTable.ChildElements.GetItem(parsedValue).InnerText;
dictionary.Add(parsedValue, fetchedValue);
return fetchedValue;

This has cut down the performance time by almost 50%. 这样可以将执行时间减少近50％。 However my metrics indicate that it still takes 208 seconds for the line of code fetching the value from the SharedStringTable object to execute 123,951 times. 但是，我的指标表明，代码行从SharedStringTable对象中获取值仍需要208秒来执行123,951次。 Is there any other way of optimising this operation? 还有其他方法可以优化此操作吗？

Answer 1

I would read the whole shared string table into your dictionary in one go rather than looking up each value as required. 我会一次性将整个共享字符串表读入您的字典中，而不是根据需要查找每个值。 This will allow you to move through the file in order and stash the values ready for a hashed lookup which will be more efficient than scanning the SST for each value you require. 这将使您能够按顺序浏览文件，并为哈希查找准备好存储值，这将比为所需的每个值扫描SST效率更高。

Running something like the following at the start of your process will allow you to access each value using dictionary[parsedValue] . 在过程开始时运行以下内容将使您可以使用dictionary[parsedValue]访问每个值。

private static void LoadDictionary()
{
    int i = 0;

    foreach (var ss in sharedStringTablePart.SharedStringTable.ChildElements)
    {
        dictionary.Add(i++, ss.InnerText);
    }
}

If your file is very large, you might see some gains using a SAX approach to read the file rather than the DOM approach above: 如果文件很大，则使用SAX方法而不是上面的DOM方法读取文件可能会带来一些好处：

private static void LoadDictionarySax()
{
    using (OpenXmlReader reader = OpenXmlReader.Create(sharedStringTablePart))
    {
        int i = 0;
        while (reader.Read())
        {
            if (reader.ElementType == typeof(SharedStringItem))
            {
                SharedStringItem ssi = (SharedStringItem)reader.LoadCurrentElement();
                dictionary.Add(i++, ssi.Text != null ? ssi.Text.Text : string.Empty);
            }
        }
    }
}

On my machine, using a file with 60000 rows and 2 columns it was around 300 times quicker using the LoadDictionary method above instead of the GetValue method from your question. 在我的机器上，使用具有60000行和2列的文件，使用上面的LoadDictionary方法而不是问题中的GetValue方法，速度快了大约300倍。 The LoadDictionarySax method gave similar performance but on a larger file (100000 rows with 10 columns) the SAX approach was around 25% faster than the LoadDictionary method. LoadDictionarySax方法提供了相似的性能，但是在较大的文件（100000行，10列）上，SAX方法比LoadDictionary方法快25％。 On an even larger file (100000 rows, 26 columns), the LoadDictionary method threw an out of memory exception but the LoadDictionarySax worked without issue. 在更大的文件（100000行，26列）上， LoadDictionary方法抛出内存LoadDictionarySax异常，但LoadDictionarySax工作。

如何提高OpenXml Excel电子表格工具中从SharedStringTable检索值的性能？

问题描述

1 个解决方案

解决方案1
4 已采纳 2017-02-28 17:12:46

如何提高OpenXml Excel电子表格工具中从SharedStringTable检索值的性能？

问题描述

1 个解决方案

解决方案1 4 已采纳 2017-02-28 17:12:46

解决方案1
4 已采纳 2017-02-28 17:12:46