简体   繁体   English

用于阅读的Apache POI Streaming(SXSSF)

[英]Apache POI Streaming (SXSSF) for Reading

I need to read large excel files and import their data to my application. 我需要读取大型excel文件并将其数据导入我的应用程序。

Since POI takes up a large amount of heap to work, often throwing OutOfMemory errors, I found out that there is a Streaming API for handling excel data in a serial fashion (rather than loading the file completely into memory) 由于POI需要大量的堆来工作,经常抛出OutOfMemory错误,我发现有一个Streaming API用于以串行方式处理excel数据(而不是将文件完全加载到内存中)

I created a xlsx workbook, with a single worksheet, and typed in several values in cells and came up with the following code to attempt reading it: 我创建了一个xlsx工作簿,只有一个工作表,并在单元格中输入了几个值,并提供了以下代码来尝试读取它:

public static void main(String[] args) throws Throwable {
    // keep 100 rows in memory, exceeding rows will be flushed to disk
    SXSSFWorkbook wb = new SXSSFWorkbook(new XSSFWorkbook(new FileInputStream("C:\\test\\tst.xlsx")));
    SXSSFSheet sheet = (SXSSFSheet) wb.getSheetAt(0);
    Row row = sheet.getRow(0);
    //row is always null
    while(row.iterator().hasNext()){ //-> NullPointerException
        System.out.println(row.getCell(0).getStringCellValue());
    }
}

However, despite being able to get its worksheets properly, it always comes with empty ( null ) rows. 但是,尽管能够正确获取其工作表,但它总是带有空( null )行。

I have researched and found out several examples of the Streaming API in the internet, but none of them are about reading existing files, they're all about generating excel files. 我已经研究并在互联网上找到了几个Streaming API的例子,但没有一个是关于读取现有文件的,它们都是关于生成excel文件的。

Is it actually possible to read data from existing .xlsx files in a stream? 实际上是否可以从流中的现有.xlsx文件中读取数据?

After digging up some more, I found out this library : 在挖掘了更多之后,我发现了这个

If you've used Apache POI in the past to read in Excel files, you probably noticed that it's not very memory efficient. 如果您以前使用过Apache POI来读取Excel文件,您可能会注意到它的内存效率不高。 Reading in an entire workbook will cause a severe memory usage spike, which can wreak havoc on a server. 读取整个工作簿会导致严重的内存使用量激增,这会对服务器造成严重破坏。

There are plenty of good reasons for why Apache has to read in the whole workbook, but most of them have to do with the fact that the library allows you to read and write with random addresses. 有很多很好的理由说明为什么Apache必须读取整个工作簿,但大多数都与库允许您使用随机地址进行读写这一事实有关。 If (and only if) you just want to read the contents of an Excel file in a fast and memory effecient way, you probably don't need this ability. 如果(并且仅当)您只想以快速且内存有效的方式读取Excel文件的内容,您可能不需要此功能。 Unfortunately, the only thing in the POI library for reading a streaming workbook requires your code to use a SAX-like parser. 不幸的是,POI库中唯一用于读取流式工作簿的东西需要您的代码使用类似SAX的解析器。 All of the friendly classes like Row and Cell are missing from that API. 该API缺少所有友好的类,如Row和Cell。

This library serves as a wrapper around that streaming API while preserving the syntax of the standard POI API. 该库充当流式API的包装器,同时保留了标准POI API的语法。 Read on to see if it's right for you. 请继续阅读,看看它是否适合您。

InputStream is = new FileInputStream(new File("/path/to/workbook.xlsx"));
StreamingReader reader = StreamingReader.builder()
        .rowCacheSize(100)    // number of rows to keep in memory (defaults to 10)
        .bufferSize(4096)     // buffer size to use when reading InputStream to file (defaults to 1024)
        .sheetIndex(0)        // index of sheet to use (defaults to 0)
        .sheetName("sheet1")  // name of sheet to use (overrides sheetIndex)
        .read(is);            // InputStream or File for XLSX file (required)

There is also SAX Event API , which reads the document and parse its contents through events. 还有SAX Event API ,它读取文档并通过事件解析其内容。

If memory footprint is an issue, then for XSSF, you can get at the underlying XML data, and process it yourself. 如果内存占用是个问题,那么对于XSSF,您可以获取基础XML数据并自行处理。 This is intended for intermediate developers who are willing to learn a little bit of low level structure of .xlsx files, and who are happy processing XML in java. 这适用于愿意学习.xlsx文件的一些低级结构的中间开发人员,以及在java中处理XML的人。 Its relatively simple to use, but requires a basic understanding of the file structure. 它使用起来比较简单,但需要对文件结构有基本的了解。 The advantage provided is that you can read a XLSX file with a relatively small memory footprint. 提供的优点是您可以读取内存占用相对较小的XLSX文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM