How can I improve the performance of retrieving values from SharedStringTable in OpenXml Excel spreadsheet tools?

Question

I'm using DocumentFormat.OpenXml to read an Excel spreadsheet. I have a performance bottleneck with the code used to look up the cell value from the SharedStringTable object (it seems to be some sort of lookup table for cell values):

var returnValue = sharedStringTablePart.SharedStringTable.ChildElements.GetItem(parsedValue).InnerText;

I've created a dictionary to ensure I only retrieve a value once:

if (dictionary.ContainsKey(parsedValue))
{
    return dictionary[parsedValue];
}

var fetchedValue = sharedStringTablePart.SharedStringTable.ChildElements.GetItem(parsedValue).InnerText;
dictionary.Add(parsedValue, fetchedValue);
return fetchedValue;

This has cut down the performance time by almost 50%. However my metrics indicate that it still takes 208 seconds for the line of code fetching the value from the SharedStringTable object to execute 123,951 times. Is there any other way of optimising this operation?

Answer 1

I would read the whole shared string table into your dictionary in one go rather than looking up each value as required. This will allow you to move through the file in order and stash the values ready for a hashed lookup which will be more efficient than scanning the SST for each value you require.

Running something like the following at the start of your process will allow you to access each value using dictionary[parsedValue] .

private static void LoadDictionary()
{
    int i = 0;

    foreach (var ss in sharedStringTablePart.SharedStringTable.ChildElements)
    {
        dictionary.Add(i++, ss.InnerText);
    }
}

If your file is very large, you might see some gains using a SAX approach to read the file rather than the DOM approach above:

private static void LoadDictionarySax()
{
    using (OpenXmlReader reader = OpenXmlReader.Create(sharedStringTablePart))
    {
        int i = 0;
        while (reader.Read())
        {
            if (reader.ElementType == typeof(SharedStringItem))
            {
                SharedStringItem ssi = (SharedStringItem)reader.LoadCurrentElement();
                dictionary.Add(i++, ssi.Text != null ? ssi.Text.Text : string.Empty);
            }
        }
    }
}

On my machine, using a file with 60000 rows and 2 columns it was around 300 times quicker using the LoadDictionary method above instead of the GetValue method from your question. The LoadDictionarySax method gave similar performance but on a larger file (100000 rows with 10 columns) the SAX approach was around 25% faster than the LoadDictionary method. On an even larger file (100000 rows, 26 columns), the LoadDictionary method threw an out of memory exception but the LoadDictionarySax worked without issue.

How can I improve the performance of retrieving values from SharedStringTable in OpenXml Excel spreadsheet tools?

Question

1 answers

solution1
4 ACCPTED 2017-02-28 17:12:46

How can I improve the performance of retrieving values from SharedStringTable in OpenXml Excel spreadsheet tools?

Question

1 answers

solution1 4 ACCPTED 2017-02-28 17:12:46

solution1
4 ACCPTED 2017-02-28 17:12:46