Apache POI - is caching workbook best way to reuse?

Question

We have been using Apache POI in production for a few years with good results. Currently on version 3.11. We only use HSSF (faster than XSSF according to our tests, and we can live without XLSX.)

We currently keep a cached map of "synchronized workbook runners", about 70 or so in memory. Think of each XLS a product, and the map key tells us which one to use. We load the cache on startup so we never read files live.

Our synchronized runners are roughly this:

public class PoiProcessorSynchronized {
  private Workbook workbook;
  public synchronized Map<String, Object> process(Request request) {
    engine.process(workbook, request); //request has input/output params
  }
}

and this has made performance pretty good (27k requests at 112ms on average for last 24 hours) some sheets are slow, some fast. We manually reset the input into the sheet between processing to ensure sheet is clean between uses.

Keep processing of the sheet synchronized is to prevent miscalculations. We did initially see some miscalculations without controlling access to the sheets. Since we did that it has been solid.

Some issues I'm concerned about:

Each XLS can only process one request at a time, per server. We could address that problem by going to some sort of pool of processors I suppose
Workbooks are relatively large in memory. If we continue to add XLS to cache, we have to add more and more memory.

Is anyone else trying to do something similar? The approach is working for now, but it feels like there should be a better way.

Is it possible we could be caching something other that Workbook? Or serializing something?

Has anyone successfully processed high volumes through workbooks WITHOUT synchronizing them? If so how?

Answer 1

At the library level, Apache POI is thread-safe. At the workbook level (+sheet/row/cell/etc level), Apache POI is not threadsafe. A given Workbook must only be worked on by a single thread at a time. If you have multiple threads working in parallel, they must have their own Workbooks to process. Two threads working on the same workbook (including working on different sheets in the same workbook) is not supported.

On the whole, loading a .xls file is fairly quick. Use a File rather than an InputStream if you can for slightly lower memory and slightly quicker loading. See the memory and performance FAQ for some guides . Make sure you're using the latest version of Apache POI for bug fixes and improvements

For your specific case, some sort of cache for the most popular workbooks might work well. Perhaps only for the larger popular workbooks, with small workbooks just always loaded on demand.

Otherwise, try some profiling, and see if there's somewhere that POI is doing too much work for certain of your files. Then report that and work to get it fixed , performance improvements are always welcomed by the project!

Answer 2

The answer to this question depends entirely on whether or not POI itself has been implemented in a completely thread-safe manner.

Given that concurrency and thread safety are not addressed anywhere in the documentation or FAQ on the POI site, you must assume it is not thread safe.

A quick peek at the POI 3.5 HSSFWorkbook code at DocJar reveals that there are no synchronization keywords and simple unsynchronized collections are used... so no, it's not thread safe.

Thus, your synchronized approach is likely the best you can do.

Apache POI - is caching workbook best way to reuse?

Question

2 answers

solution1
2 ACCPTED 2016-01-07 14:49:40

solution2
1 2016-01-01 00:17:54

Apache POI - is caching workbook best way to reuse?

Question

2 answers

solution1 2 ACCPTED 2016-01-07 14:49:40

solution2 1 2016-01-01 00:17:54

solution1
2 ACCPTED 2016-01-07 14:49:40

solution2
1 2016-01-01 00:17:54