简体   繁体   English

Apache POI - 是缓存工作簿的最佳重用方式吗?

[英]Apache POI - is caching workbook best way to reuse?

We have been using Apache POI in production for a few years with good results. 我们已经在生产中使用Apache POI几年,效果很好。 Currently on version 3.11. 目前在3.11版本上。 We only use HSSF (faster than XSSF according to our tests, and we can live without XLSX.) 我们只使用HSSF(根据我们的测试,比XSSF更快,我们可以在没有XLSX的情况下使用。)

We currently keep a cached map of "synchronized workbook runners", about 70 or so in memory. 我们目前保留一个“同步工作簿运行器”的缓存映射,大约70左右的内存。 Think of each XLS a product, and the map key tells us which one to use. 想想每个XLS产品,地图键告诉我们使用哪个产品。 We load the cache on startup so we never read files live. 我们在启动时加载缓存,所以我们永远不会实时读取文件。

Our synchronized runners are roughly this: 我们的同步跑步者大致如下:

public class PoiProcessorSynchronized {
  private Workbook workbook;
  public synchronized Map<String, Object> process(Request request) {
    engine.process(workbook, request); //request has input/output params
  }
}

and this has made performance pretty good (27k requests at 112ms on average for last 24 hours) some sheets are slow, some fast. 这使得性能相当不错(过去24小时平均为112k,平均为112k),有些纸张很慢,有些很快。 We manually reset the input into the sheet between processing to ensure sheet is clean between uses. 我们在处理之间手动将输入重置到表单中,以确保表单在使用之间是干净的。

Keep processing of the sheet synchronized is to prevent miscalculations. 保持纸张同步处理是为了防止错误计算。 We did initially see some miscalculations without controlling access to the sheets. 我们最初看到一些错误的计算而没有控制对纸张的访问。 Since we did that it has been solid. 自从我们这样做以来,它一直很稳固。

Some issues I'm concerned about: 我关心的一些问题:

  1. Each XLS can only process one request at a time, per server. 每个XLS每个服务器一次只能处理一个请求。 We could address that problem by going to some sort of pool of processors I suppose 我们可以通过访问某种处理器池来解决这个问题
  2. Workbooks are relatively large in memory. 工作簿的内存相对较大。 If we continue to add XLS to cache, we have to add more and more memory. 如果我们继续将XLS添加到缓存中,我们必须添加越来越多的内存。

Is anyone else trying to do something similar? 还有其他人试图做类似的事吗? The approach is working for now, but it feels like there should be a better way. 这种方法现在正在运作,但感觉应该有更好的方法。

Is it possible we could be caching something other that Workbook? 是否有可能我们可以缓存其他工作簿? Or serializing something? 或序列化的东西?

Has anyone successfully processed high volumes through workbooks WITHOUT synchronizing them? 有没有人通过工作簿成功处理大量数据而不同步它们? If so how? 如果是这样的话?

At the library level, Apache POI is thread-safe. 在库级别,Apache POI是线程安全的。 At the workbook level (+sheet/row/cell/etc level), Apache POI is not threadsafe. 在工作簿级别(+ sheet / row / cell / etc级别),Apache POI不是线程安全的。 A given Workbook must only be worked on by a single thread at a time. 给定的工作簿一次只能由一个线程处理。 If you have multiple threads working in parallel, they must have their own Workbooks to process. 如果您有多个并行工作的线程,则它们必须具有自己的工作簿才能进行处理。 Two threads working on the same workbook (including working on different sheets in the same workbook) is not supported. 不支持在同一工作簿上工作的两个线程(包括在同一工作簿中处理不同的工作表)。

On the whole, loading a .xls file is fairly quick. 总的来说,加载.xls文件相当快。 Use a File rather than an InputStream if you can for slightly lower memory and slightly quicker loading. 如果可以稍微降低内存并加快加载速度,请使用File而不是InputStream See the memory and performance FAQ for some guides . 有关指南,请参阅内存和性能常见问题解答 Make sure you're using the latest version of Apache POI for bug fixes and improvements 确保您使用最新版本的Apache POI进行错误修复和改进

For your specific case, some sort of cache for the most popular workbooks might work well. 对于您的特定情况,最流行的工作簿的某种缓存可能会很好。 Perhaps only for the larger popular workbooks, with small workbooks just always loaded on demand. 也许只适用于较大的流行工作簿,小工作簿总是按需加载。

Otherwise, try some profiling, and see if there's somewhere that POI is doing too much work for certain of your files. 否则,尝试一些分析,看看POI是否在某些地方为某些文件做了太多工作。 Then report that and work to get it fixed , performance improvements are always welcomed by the project! 然后报告并努力使其得到修复 ,项目始终欢迎性能改进!

The answer to this question depends entirely on whether or not POI itself has been implemented in a completely thread-safe manner. 这个问题的答案完全取决于POI本身是否以完全线程安全的方式实现。

Given that concurrency and thread safety are not addressed anywhere in the documentation or FAQ on the POI site, you must assume it is not thread safe. 鉴于并发和线程安全未在POI站点上的文档或FAQ中的任何地方解决,您必须假设它不是线程安全的。

A quick peek at the POI 3.5 HSSFWorkbook code at DocJar reveals that there are no synchronization keywords and simple unsynchronized collections are used... so no, it's not thread safe. 快速浏览一下DocJar上的POI 3.5 HSSFWorkbook代码,可以看出没有同步关键字和简单的非同步集合......所以不,它不是线程安全的。

Thus, your synchronized approach is likely the best you can do. 因此,您的同步方法可能是您可以做的最好的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM