Apache PDFBOX-使用split（PDDocument文档）时出现java.lang.OutOfMemoryError

Question

I am trying to split a document with a decent 300 pages using Apache PDFBOX API V2.0.2. 我正在尝试使用Apache PDFBOX API V2.0.2拆分300页的文档。 While trying to split the pdf file to single pages using the following code: 尝试使用以下代码将pdf文件拆分为单个页面时：

        PDDocument document = PDDocument.load(inputFile);
        Splitter splitter = new Splitter();
        List<PDDocument> splittedDocuments = splitter.split(document); //Exception happens here

I receive the following exception 我收到以下异常

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

Which indicates that the GC is taking much time to clear the heap that is not justified by the amount reclaimed. 这表明GC需要花费大量时间来清除没有被回收量证明合理的堆。

There are numerous JVM tuning methods that can solve the situation, however, all of these are just treating the symptom and not the real issue. 有许多JVM调优方法可以解决这种情况，但是，所有这些方法都只是在解决症状而不是真正的问题。

One final note, I am using JDK6, hence using the new java 8 Consumer is not an option in my case.Thanks 最后一点，我正在使用JDK6，因此在我的情况下，不能使用新的Java 8 Consumer。

Edit: 编辑：

This is not a duplicate question of http://stackoverflow.com/questions/37771252/splitting-a-pdf-results-in-very-large-pdf-documents-with-pdfbox-2-0-2 as: 这不是http://stackoverflow.com/questions/37771252/splitting-a-pdf-results-in-very-large-pdf-documents-with-pdfbox-2-0-2的重复问题，如下所示：

1. I do not have the size problem mentioned in the aforementioned
    topic. I am slicing a 270 pages 13.8MB PDF file and after slicing
    the size of each slice is an average of 80KB with total size of
    30.7MB.
 2. The Split throws the exception even before it returns the splitted parts.

I found that the split can pass as long as I am not passing the whole document, instead I pass it as "Batches" with 20-30 pages each, which does the job. 我发现只要不传递整个文档，拆分就可以通过，而是将其作为“批量”传递，每个批量20-30页，即可完成工作。

Answer 1

PDF Box stores the parts resulted from the split operation as objects of type PDDocument in the heap as objects, which results in heap getting filled fast, and even if you call the close() operation after every round in the loop, still the GC will not be able to reclaim the heap size in the same manner it gets filled. PDF Box将拆分操作产生的零件作为PDDocument类型的对象存储为堆中的对象，这导致堆快速填充，即使在循环的每一轮之后调用close（）操作，GC仍会无法以与填充相同的方式回收堆大小。

An option is to split the document split operation to batches, in which each batch is a relatively manageable chunk (10 to 40 pages) 一个选项是将文档拆分操作拆分为多个批次，其中每个批次是一个相对易于管理的块（10至40页）

public void execute() {
    File inputFile = new File(path/to/the/file.pdf);
    PDDocument document = null;
    try {
        document = PDDocument.load(inputFile);

        int start = 1;
        int end = 1;
        int batchSize = 50;
        int finalBatchSize = document.getNumberOfPages() % batchSize;
        int noOfBatches = document.getNumberOfPages() / batchSize;
        for (int i = 1; i <= noOfBatches; i++) {
            start = end;
            end = start + batchSize;
            System.out.println("Batch: " + i + " start: " + start + " end: " + end);
            split(document, start, end);
        }
        // handling the remaining
        start = end;
        end += finalBatchSize;
        System.out.println("Final Batch  start: " + start + " end: " + end);
        split(document, start, end);

    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        //close the document
    }
}

private void split(PDDocument document, int start, int end) throws IOException {
    List<File> fileList = new ArrayList<File>();
    Splitter splitter = new Splitter();
    splitter.setStartPage(start);
    splitter.setEndPage(end);
    List<PDDocument> splittedDocuments = splitter.split(document);
    String outputPath = Config.INSTANCE.getProperty("outputPath");
    PDFTextStripper stripper = new PDFTextStripper();

    for (int index = 0; index < splittedDocuments.size(); index++) {
        String pdfFullPath = document.getDocumentInformation().getTitle() + index + start+ ".pdf";
        PDDocument splittedDocument = splittedDocuments.get(index);

        splittedDocument.save(pdfFullPath);
    }
}

Apache PDFBOX-使用split（PDDocument文档）时出现java.lang.OutOfMemoryError

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-07-10 17:23:28

Apache PDFBOX-使用split（PDDocument文档）时出现java.lang.OutOfMemoryError

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-07-10 17:23:28

解决方案1
1 已采纳 2016-07-10 17:23:28