Apache PDFBOX-使用split（PDDocument文档）时出现java.lang.OutOfMemoryError

Question

我正在尝试使用Apache PDFBOX API V2.0.2拆分300页的文档。 尝试使用以下代码将pdf文件拆分为单个页面时：

        PDDocument document = PDDocument.load(inputFile);
        Splitter splitter = new Splitter();
        List<PDDocument> splittedDocuments = splitter.split(document); //Exception happens here

我收到以下异常

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

这表明GC需要花费大量时间来清除没有被回收量证明合理的堆。

有许多JVM调优方法可以解决这种情况，但是，所有这些方法都只是在解决症状而不是真正的问题。

最后一点，我正在使用JDK6，因此在我的情况下，不能使用新的Java 8 Consumer。

编辑：

这不是http://stackoverflow.com/questions/37771252/splitting-a-pdf-results-in-very-large-pdf-documents-with-pdfbox-2-0-2的重复问题，如下所示：

1. I do not have the size problem mentioned in the aforementioned
    topic. I am slicing a 270 pages 13.8MB PDF file and after slicing
    the size of each slice is an average of 80KB with total size of
    30.7MB.
 2. The Split throws the exception even before it returns the splitted parts.

我发现只要不传递整个文档，拆分就可以通过，而是将其作为“批量”传递，每个批量20-30页，即可完成工作。

Answer 1

PDF Box将拆分操作产生的零件作为PDDocument类型的对象存储为堆中的对象，这导致堆快速填充，即使在循环的每一轮之后调用close（）操作，GC仍会无法以与填充相同的方式回收堆大小。

一个选项是将文档拆分操作拆分为多个批次，其中每个批次是一个相对易于管理的块（10至40页）

public void execute() {
    File inputFile = new File(path/to/the/file.pdf);
    PDDocument document = null;
    try {
        document = PDDocument.load(inputFile);

        int start = 1;
        int end = 1;
        int batchSize = 50;
        int finalBatchSize = document.getNumberOfPages() % batchSize;
        int noOfBatches = document.getNumberOfPages() / batchSize;
        for (int i = 1; i <= noOfBatches; i++) {
            start = end;
            end = start + batchSize;
            System.out.println("Batch: " + i + " start: " + start + " end: " + end);
            split(document, start, end);
        }
        // handling the remaining
        start = end;
        end += finalBatchSize;
        System.out.println("Final Batch  start: " + start + " end: " + end);
        split(document, start, end);

    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        //close the document
    }
}

private void split(PDDocument document, int start, int end) throws IOException {
    List<File> fileList = new ArrayList<File>();
    Splitter splitter = new Splitter();
    splitter.setStartPage(start);
    splitter.setEndPage(end);
    List<PDDocument> splittedDocuments = splitter.split(document);
    String outputPath = Config.INSTANCE.getProperty("outputPath");
    PDFTextStripper stripper = new PDFTextStripper();

    for (int index = 0; index < splittedDocuments.size(); index++) {
        String pdfFullPath = document.getDocumentInformation().getTitle() + index + start+ ".pdf";
        PDDocument splittedDocument = splittedDocuments.get(index);

        splittedDocument.save(pdfFullPath);
    }
}

Apache PDFBOX-使用split（PDDocument文档）时出现java.lang.OutOfMemoryError

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-07-10 17:23:28

Apache PDFBOX-使用split（PDDocument文档）时出现java.lang.OutOfMemoryError

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-07-10 17:23:28

解决方案1
1 已采纳 2016-07-10 17:23:28