简体   繁体   English

使用HSSF的Apache POI比使用XSSF快得多-接下来是什么?

[英]Apache POI much quicker using HSSF than XSSF - what next?

I've been having some issues with parsing .xlsx files with Apache POI - I am getting java.lang.OutOfMemoryError: Java heap space in my deployed app. 我在使用Apache POI解析.xlsx文件时遇到了一些问题-我在部署的应用程序中遇到了java.lang.OutOfMemoryError: Java heap space I'm only processing files under 5MB and around 70,000 rows so my suspicion from reading number other questions is that something is amiss. 我仅处理5MB以下和大约70,000行以下的文件,因此我在阅读其他问题时怀疑是什么地方不对劲。

As suggested in this comment I decided to run SSPerformanceTest.java with the suggested variables so see if there is anything wrong with my code or setup. 本注释中所建议,我决定使用建议的变量运行SSPerformanceTest.java ,以便查看我的代码或设置是否存在任何问题。 The results show a significant difference between HSSF ( .xls ) and XSSF ( .xlsx ): 结果显示HSSF( .xls )和XSSF( .xlsx )之间存在显着差异:

1) HSSF 50000 50 1: Elapsed 1 seconds 1) HSSF 50000 50 1:经过1秒

2) SXSSF 50000 50 1: Elapsed 5 seconds 2) SXSSF 50000 50 1:经过5秒

3) XSSF 50000 50 1: Elapsed 15 seconds 3) XSSF 50000 50 1:经过15秒

The FAQ specifically says: 常见问题解答专门说明:

If you can't run that with 50,000 rows and 50 columns in all of HSSF, XSSF and SXSSF in under 3 seconds (ideally a lot less!), the problem is with your environment. 如果您在3秒内无法在所有HSSF,XSSF和SXSSF中使用50,000行和50列来运行它(最好少得多!),那么问题就出在您的环境上。

Next, it says to run XLS2CSV.java which I have done. 接下来,它说运行我完成的XLS2CSV.java Feeding in the XSSF file generated above (with 50000 rows and 50 columns) takes around 15 seconds - the same amount it took to write the file. 馈送上面生成的XSSF文件(具有50000行和50列)大约需要15秒,与写入文件所花费的时间相同。

Is something wrong with my environment, and if so how do I investigate further? 我的环境有问题吗?如果是,我该如何进一步调查?

Stats from VisualVM show the heap used shooting up to 1.2Gb during the processing. VisualVM的统计数据显示,处理期间使用的堆内存高达1.2Gb。 Surely this is way too high considering that's an extra gig on top of the heap compared to before processing began? 考虑到与处理开始之前相比,堆顶部还有额外的演出,这肯定太高了吗?

这里的堆空间肯定太高了吗?

Note: The heap space exception mentioned above only happens in production (on Google App Engine) and only for .xlsx files, however the tests mentioned in this question have all been run on my development machine with -Xmx2g . 注意:上面提到的堆空间异常仅在生产环境中(在Google App Engine上)发生,并且仅对.xlsx文件发生,但是此问题中提到的测试已全部在带有-Xmx2g开发计算机上-Xmx2g I'm hoping that if I can fix the problem on my development setup it will use less memory when I deploy. 我希望,如果可以在开发设置中解决此问题,则部署时将使用较少的内存。

Stack trace from app engine: 来自应用程序引擎的堆栈跟踪:

Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.xmlbeans.impl.store.Cur.createElementXobj(Cur.java:260) at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.startElement(Cur.java:2997) at org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3211) at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082) at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1802) at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521) 引起原因:java.lang.OutOfMemoryError:org.apache.xmlbeans.impl.store.Cur $ CurLoadContext.startElement(org.apache.xmlbeans.impl.store.Cur.createElementXobj(Cur.java:260)处的Java堆空间org.apache.xmlbeans.impl.store.Locale $ SaxHandler.startElement(Locale.java:3211)处的Cur.java:2997)org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java: 1082),网址为org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1802),网址为org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521)

I was facing same kind of issue to read bulky .xlsx file using Apache POI and I came across 我遇到了使用Apache POI读取庞大的.xlsx文件的问题,并且遇到了

excel-streaming-reader-github excel-streaming-reader-github

This library serves as a wrapper around that streaming API while preserving the syntax of the standard POI API 该库充当该流API的包装,同时保留标准POI API的语法

This library can help you to read large files. 该库可以帮助您读取大文件。

The average XLSX sheet I work is about 18-22 sheets of 750 000 rows with 13-20 columns. 我工作的平均XLSX工作表约为18-22张,共750 000行,13-20列。 This is spinning in the Spring web application with lots of other functionalities. 在Spring Web应用程序中,它具有许多其他功能。 I gave to whole application not that much of memory: -Xms1024m -Xmx4096m - and it works great! 我没有为整个应用程序-Xms1024m -Xmx4096m太多的内存: -Xms1024m -Xmx4096m效果很好!

First of all dumping code: it is wrong to load each and every data row in memory and than starting to dump it. 首先转储代码:将每个数据行加载到内存中并开始转储是错误的。 In my case (reporting from the PostgreSQL database) I reworked data dump procedure to use RowCallbackHandler to write to my XLSX, during this once I reach "my limit" of 750000 rows, I create new sheet. 就我而言(从PostgreSQL数据库中报告),我重新设计了数据转储过程,以使用RowCallbackHandler写入XLSX,在此期间,当我达到750000行的“我的限制”时,我创建了新工作表。 And workbook is created with visibility window of 50 rows. 并创建具有50行可见性窗口的工作簿。 In this way I am able to dump huge volumes: size of XLSX file is about 1230Mb. 这样,我就可以转储大量文件:XLSX文件的大小约为1230Mb。

Some code to write sheets: 一些写表格的代码:

    jdbcTemplate.query(
        new PreparedStatementCreator() {
            @Override
            public PreparedStatement createPreparedStatement(Connection connection) throws SQLException {
                PreparedStatement statement = connection.prepareStatement(finalQuery, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
                statement.setFetchSize(100);
                statement.setFetchDirection(ResultSet.FETCH_FORWARD);
                return statement;
            }
        }, new RowCallbackHandler() {
            Sheet sheet = null;
            int i = 750000;
            int tableId = 0;

            @Override
            public void processRow(ResultSet resultSet) throws SQLException {
                if (i == 750000) {
                    tableId++;
                    i = 0;
                    sheet = wb.createSheet(sheetName.concat(String.format("%02d%n", tableId)));


                    Row r = sheet.createRow(0);

                    Cell c = r.createCell(0);
                    c.setCellValue("id");
                    c = r.createCell(1);
                    c.setCellValue("Дата");
                    c = r.createCell(2);
                    c.setCellValue("Комментарий");
                    c = r.createCell(3);
                    c.setCellValue("Сумма операции");
                    c = r.createCell(4);
                    c.setCellValue("Дебет");
                    c = r.createCell(5);
                    c.setCellValue("Страхователь");
                    c = r.createCell(6);
                    c.setCellValue("Серия договора");
                    c = r.createCell(7);
                    c.setCellValue("Номер договора");
                    c = r.createCell(8);
                    c.setCellValue("Основной агент");
                    c = r.createCell(9);
                    c.setCellValue("Кредит");
                    c = r.createCell(10);
                    c.setCellValue("Программа");
                    c = r.createCell(11);
                    c.setCellValue("Дата начала покрытия");
                    c = r.createCell(12);
                    c.setCellValue("Дата планового окончания покрытия");
                    c = r.createCell(13);
                    c.setCellValue("Периодичность уплаты взносов");
                }
                i++;

                PremiumEntity e = PremiumEntity.builder()
                    .Id(resultSet.getString("id"))
                    .OperationDate(resultSet.getDate("operation_date"))
                    .Comments(resultSet.getString("comments"))
                    .SumOperation(resultSet.getBigDecimal("sum_operation").doubleValue())
                    .DebetAccount(resultSet.getString("debet_account"))
                    .Strahovatelname(resultSet.getString("strahovatelname"))
                    .Seria(resultSet.getString("seria"))
                    .NomPolica(resultSet.getLong("nom_polica"))
                    .Agentname(resultSet.getString("agentname"))
                    .CreditAccount(resultSet.getString("credit_account"))
                    .Program(resultSet.getString("program"))
                    .PoliciStartDate(resultSet.getDate("polici_start_date"))
                    .PoliciPlanEndDate(resultSet.getDate("polici_plan_end_date"))
                    .Periodichn(resultSet.getString("id_periodichn"))
                    .build();

                Row r = sheet.createRow(i);
                Cell c = r.createCell(0);
                c.setCellValue(e.getId());

                if (e.getOperationDate() != null) {
                    c = r.createCell(1);
                    c.setCellStyle(dateStyle);
                    c.setCellValue(e.getOperationDate());
                }

                c = r.createCell(2);
                c.setCellValue(e.getComments());

                c = r.createCell(3);
                c.setCellValue(e.getSumOperation());

                c = r.createCell(4);
                c.setCellValue(e.getDebetAccount());

                c = r.createCell(5);
                c.setCellValue(e.getStrahovatelname());

                c = r.createCell(6);
                c.setCellValue(e.getSeria());

                c = r.createCell(7);
                c.setCellValue(e.getNomPolica());

                c = r.createCell(8);
                c.setCellValue(e.getAgentname());

                c = r.createCell(9);
                c.setCellValue(e.getCreditAccount());

                c = r.createCell(10);
                c.setCellValue(e.getProgram());

                if (e.getPoliciStartDate() != null) {
                    c = r.createCell(11);
                    c.setCellStyle(dateStyle);
                    c.setCellValue(e.getPoliciStartDate());
                }
                ;

                if (e.getPoliciPlanEndDate() != null) {
                    c = r.createCell(12);
                    c.setCellStyle(dateStyle);
                    c.setCellValue(e.getPoliciPlanEndDate());
                }

                c = r.createCell(13);
                c.setCellValue(e.getPeriodichn());
            }
        });

After reworking my code on dumping the data to XLSX, I came to problem, that it requires Office in 64 bits to open them. 在重新整理将数据转储到XLSX的代码后,我遇到了问题,它需要64位的Office来打开它们。 So I need to split my workbook with lots of sheets into separate XLSX files with single sheets to make them readable on average machine. 因此,我需要将我的工作簿中包含大量工作表的内容拆分为具有单个工作表的单独的XLSX文件,以使其在普通计算机上可读。 And again I used small visibility windows and streamed processing, and kept the whole application working well without any sights of OutOfMemory. 再一次,我使用小的可见性窗口和流式处理,使整个应用程序正常运行,而不会看到OutOfMemory。

Some code to read and split sheets: 一些读取和拆分工作表的代码:

        OPCPackage opcPackage = OPCPackage.open(originalFile, PackageAccess.READ);


        ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(opcPackage);
        XSSFReader xssfReader = new XSSFReader(opcPackage);
        StylesTable styles = xssfReader.getStylesTable();
        XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
        int index = 0;
        while (iter.hasNext()) {
            InputStream stream = iter.next();
            String sheetName = iter.getSheetName();

            DataFormatter formatter = new DataFormatter();
            InputSource sheetSource = new InputSource(stream);

            SheetToWorkbookSaver saver = new SheetToWorkbookSaver(sheetName);
            try {
                XMLReader sheetParser = SAXHelper.newXMLReader();
                ContentHandler handler = new XSSFSheetXMLHandler(
                    styles, null, strings, saver, formatter, false);
                sheetParser.setContentHandler(handler);
                sheetParser.parse(sheetSource);
            } catch(ParserConfigurationException e) {
                throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage());
            }

            stream.close();

            // this creates new File descriptors inside storage
            FileDto partFile = new FileDto("report_".concat(StringUtils.trimToEmpty(sheetName)).concat(".xlsx"));
            File cloneFile = fileStorage.read(partFile);
            FileOutputStream cloneFos = new FileOutputStream(cloneFile);
            saver.getWb().write(cloneFos);
            cloneFos.close();
        }

and

public class SheetToWorkbookSaver implements XSSFSheetXMLHandler.SheetContentsHandler {

    private SXSSFWorkbook wb;
    private Sheet sheet;
    private CellStyle dateStyle ;


    private Row currentRow;

    public SheetToWorkbookSaver(String workbookName) {
        this.wb = new SXSSFWorkbook(50);
        this.dateStyle = this.wb.createCellStyle();
        this.dateStyle.setDataFormat(this.wb.getCreationHelper().createDataFormat().getFormat("dd.mm.yyyy"));

        this.sheet = this.wb.createSheet(workbookName);

    }

    @Override
    public void startRow(int rowNum) {
        this.currentRow = this.sheet.createRow(rowNum);
    }

    @Override
    public void endRow(int rowNum) {

    }

    @Override
    public void cell(String cellReference, String formattedValue, XSSFComment comment) {
        int thisCol = (new CellReference(cellReference)).getCol();
        Cell c = this.currentRow.createCell(thisCol);
        c.setCellValue(formattedValue);
        c.setCellComment(comment);
    }

    @Override
    public void headerFooter(String text, boolean isHeader, String tagName) {

    }


    public SXSSFWorkbook getWb() {
        return wb;
    }
}

So it reads and writes data. 因此,它读取和写入数据。 I guess in your case you should rework your code to same patterns: keep in memory only small footprint of data. 我想在您的情况下,您应该将代码重新整理为相同的模式:仅在内存中保留少量数据。 So I would suggest for reading create custom SheetContentsReader , which will be pushing data to some database, where it can be easily processed, aggregated, etc. 因此,我建议阅读创建自定义SheetContentsReader ,该方法会将数据推送到某个数据库,在该数据库中可以轻松对其进行处理,聚合等。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM