简体   繁体   English

使用 apache-poi 在 Java 中写入较大的 Excel 导致 CPU 使用率过高

[英]Writing large Excel in Java causing high CPU usage using apache-poi

Writing large data around 1/2 million records with 25 columns.使用 25 列写入大约 1/2 百万条记录的大数据。

Using apache-poi streaming workbook to write data from list to excel file.使用 apache-poi 流工作簿将列表中的数据写入 excel 文件。 when tested locally it is giving high CPU spikes in local machine too.在本地测试时,它也会在本地机器中产生高 CPU 峰值。 appears to be causing when writing workbook data to file似乎是在将工作簿数据写入文件时引起的

workbook.write(fileOutputStream) // it is causing CPU spikes debugged and confirmed.

It is causing high CPU usage in cloud app (deployed in kube.netes) and restarting application as it is hitting resource limits.它导致云应用程序(部署在 kube.netes 中)的 CPU 使用率很高,并在达到资源限制时重新启动应用程序。 we have a simple app with 2042Mi memory and 1024m CPU config.我们有一个带有 2042Mi memory 和 1024m CPU 配置的简单应用程序。

Is there any way to write a large excel file without impacting CPU and Memory and java heap efficiently.有什么方法可以在不影响 CPU 和 Memory 和 java 堆的情况下有效地写入一个大的 excel 文件。

(NOTE: can't use csv or other formats as business requirement is for excel files) (注意:不能使用 csv 或其他格式,因为业务要求是针对 excel 文件)

Code using:代码使用:

import java.io.File;
import java.io.FileOutputStream;
import java.util.List;

import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.CellStyle;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.xssf.streaming.SXSSFWorkbook;
import org.springframework.stereotype.Service;

import com.king.medicalcollege.model.Medico;

@Service
public class ExcelWriterService {

    // file is an empty file already created
    // Large List around 500K records of medico data [Medico is POJO]

    public File writeData(File file, List<Medico> medicos) {

        SXSSFWorkbook sxssfWorkbook = null;
        try (SXSSFWorkbook workbook = sxssfWorkbook = new SXSSFWorkbook(1);
                FileOutputStream fileOutputStream = new FileOutputStream(file)) {

            Sheet sheet = workbook.createSheet();
            CellStyle cellStyle = workbook.createCellStyle();
            int rowNum = 0;
            for (Medico medico : medicos) {
                Row row = sheet.createRow(rowNum);
                //just adding POJO values (25 fields)  into ROW 
                addDataInRow(medico, row, cellStyle);
                rowNum++;
            }

            //workbook.write causing CPU spike
            workbook.write(fileOutputStream);

            workbook.dispose();

        } catch (Exception exception) {
            return null;
        } finally {
            if (sxssfWorkbook != null) {
                sxssfWorkbook.dispose();
            }
        }

        return file;
    }

    private void addDataInRow(Medico medico, Row row, CellStyle cellStyle) {
        Cell cell_0 = row.createCell(0);
        cell_0.setCellValue(medico.getFirstName());
        cell_0.setCellStyle(cellStyle);
        
        Cell cell_1 = row.createCell(1);
        cell_1.setCellValue(medico.getMiddleName());
        cell_1.setCellStyle(cellStyle);
        
        Cell cell_2 = row.createCell(2);
        cell_2.setCellValue(medico.getLastName());
        cell_2.setCellStyle(cellStyle);
        
        Cell cell_3 = row.createCell(2);
        cell_3.setCellValue(medico.getFirstName());
        cell_3.setCellStyle(cellStyle);
        
        //...... around 25 columns will be added like this
    }
}

The question "Is there any way to write a large excel file without impacting CPU and Memory?"问题“有没有办法在不影响 CPU 和 Memory 的情况下写入一个大的 excel 文件?” is something like: Let me have my cake and eat it too.就像是:让我吃我的蛋糕,也把它吃掉。 In German we say: Wash me but don't get me wet.在德语中我们说:洗我,但不要弄湿我。 In other words: All content which shall be in a file on a computer must be through the CPU and must be in memory while processing.换句话说:计算机上一个文件中的所有内容必须经过CPU处理,必须在memory中处理。

To get a clue about what amount of memory we are talking about, lets have a simple calculation of what it means to have 500,000 rows with 25 columns:要了解我们正在谈论的 memory 的数量,让我们简单计算一下 500,000 行和 25 列意味着什么:

                     Cell value                                                                         Len     500,000 times  KiByte        MiByte
Single value         some cell value                                                                    15          
25 columns           some cell value, some cell value, some cell value, some cell value, some ...       425     212500000      207519.5313   202.6557922
XML of single value  <c r="C99" t="inlineStr" s="9"><is><t>some cell value</t></is></c>                 66       
XML of 25 columns    <c r="C99" t="inlineStr" s="9"><is><t>some cell value</t></is></c><c r="...        1650    825000000      805664.0625   786.781311

That shows, even having only plain text, 500,000 rows with 25 columns having cell value "some cell value" will take 202.6557922 MiByte of memory.这表明,即使只有纯文本,500,000 行和 25 列的单元格值“某个单元格值”也将占用 memory 的 202.6557922 MiByte。

But a Excel file is not simply plain text.但是 Excel 文件不仅仅是纯文本。 Current Open Office Excel format stores XML. And that needs much more memory because of the XML-overhead.当前的 Open Office Excel 格式存储 XML。由于 XML 开销,这需要更多的 memory。 The above shows that 500,000 rows with 25 columns having cell value "some cell value" will take 786.781311 MiByte of memory, when stored as XML.上面显示,当存储为 XML 时,具有单元格值“某些单元格值”的 500,000 行和 25 列将占用 memory 的 786.781311 MiByte。

That 786.781311 MiByte of memory is only to store the cells, there is more overhead to store the rows, the sheets, the workbook, the styles, the relations, ... memory 的 786.781311 MiByte 仅用于存储单元格,存储行、工作表、工作簿、styles、关系、...

SXSSFWorkbook claims to be a streaming approach. SXSSFWorkbook声称是一种流式处理方法。 But it only streams the cells into rows into sheets as temporary files.但它只是将单元格作为临时文件流式传输到工作表中。 It additional needs memory to hold the workbook, the styles, the relations, ... After streaming, it needs memory to put that all together into the workbook.它还需要 memory 来保存工作簿,styles,关系,...流式处理后,需要 memory 将它们全部放入工作簿。 And at least while this process the whole workbook size must be processed through CPU and memory.至少在这个过程中,整个工作簿的大小必须通过 CPU 和 memory 来处理。

Conclusion: Excel is a spreadsheet application.结论:Excel是一个电子表格应用。 It is not a good format for data exchange.它不是一种适合数据交换的格式。 Good formats for data exchange are: Plain text (CSV) or plain XML or JSON, as those really can contain streams of plain data rows without the overhead of sheets in a workbook.适合数据交换的格式是:纯文本 (CSV) 或纯文本 XML 或 JSON,因为它们确实可以包含纯数据行流而无需工作簿中的工作表开销。

If your business requirement insists in Excel files, then give the application more resources.如果你的业务需求坚持在Excel个文件,那就给应用更多的资源。

My code provided in Get smart Excel table with Apache POI works using organizations-1000000.csv , got downloaded from https://www.datablist.com/learn/csv/download-sample-csv-files .我在Get smart Excel table with Apache POI works using organizations-1000000.csv 1000000.csv 中提供的代码是从https://www.datablist.com/learn/csv/download-sample-csv-files下载的。 It creates a Excel.xlsx having 1,000,000 data rows and 97,450 KB of size.它创建了一个Excel.xlsx ,其中包含 1,000,000 个数据行和 97,450 KB 的大小。

For me java -XX:+PrintFlagsFinal -version | findstr HeapSize对我来说java -XX:+PrintFlagsFinal -version | findstr HeapSize java -XX:+PrintFlagsFinal -version | findstr HeapSize prints: java -XX:+PrintFlagsFinal -version | findstr HeapSize打印:

   size_t ErgoHeapSizeLimit                        = 0                                         {product} {default}
   size_t HeapSizePerGCThread                      = 43620760                                  {product} {default}
   size_t InitialHeapSize                          = 132120576                                 {product} {ergonomic}
   size_t LargePageHeapSizeThreshold               = 134217728                                 {product} {default}
   size_t MaxHeapSize                              = 2113929216                                {product} {ergonomic}
   size_t MinHeapSize                              = 8388608                                   {product} {ergonomic}
    uintx NonNMethodCodeHeapSize                   = 5832780                                {pd product} {ergonomic}
    uintx NonProfiledCodeHeapSize                  = 122912730                              {pd product} {ergonomic}
    uintx ProfiledCodeHeapSize                     = 122912730                              {pd product} {ergonomic}
   size_t SoftMaxHeapSize                          = 2113929216                             {manageable} {ergonomic}
java version "15" 2020-09-15
Java(TM) SE Runtime Environment (build 15+36-1562)
Java HotSpot(TM) 64-Bit Server VM (build 15+36-1562, mixed mode, sharing)

You seem to be doing the right thing by giving SXSSFWorkbook a window size (although 1 might be too small causing problems?).您似乎通过给 SXSSFWorkbook 一个 window 大小来做正确的事情(尽管1可能太小导致问题?)。 The workbook should be getting flushed to disk when the number of rows exceed the limit you set, reducing memory usage.当行数超过您设置的限制时,工作簿应该被刷新到磁盘,从而减少 memory 的使用。 I doubt there is a workaround for reducing cpu usage though.我怀疑是否有减少 CPU 使用率的解决方法。

You can try to limit your memory usage by adjusting JVM parameters so it doesn't trigger the K8s limit.您可以尝试通过调整 JVM 参数来限制您的 memory 使用,这样它就不会触发 K8s 限制。 Have a look at these: -Xmx -Xms -XX:MaxRAM -XX:+UseSerialGC看看这些: -Xmx -Xms -XX:MaxRAM -XX:+UseSerialGC

Have you considered using an alternate library for writing Excel files?您是否考虑过使用备用库来编写 Excel 文件? For example, have a look at this SO answer: Are there any alternatives to using Apache POI Java for Microsoft Office?例如,看看这个 SO 答案: Are there any alternatives to using Apache POI Java for Microsoft Office?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM