How to merge many pdfs

Question

I want to ask how to merge more than 100k pdf files (file each pdf around 160 KB ) into 1 pdf file ?

I already read this tutorial, that code is working for few pdf. But when I tried for 10k pdf files I get this error "java.lang.OutOfMemoryError: GC overhead limit exceeded"

I already tried using -Xmx or -Xms, the error become "java heap space".

I am also using "pdf.flushCopiedObjects(firstSourcePdf);" it doesn't help. Or maybe I am using it incorrectly?

File file = new File(pathName);
        File[] listFile = file.listFiles();
        if (listFile == null) {
            throw new Exception("File not Found at " + pathName);
        }
        Arrays.sort(listFile, 0, listFile.length - 1);

        PdfADocument pdf = new PdfADocument(new PdfWriter(dest),
            PdfAConformanceLevel.PDF_A_1A,
            new PdfOutputIntent("Custom", "", "http://www.color.org",
                "sRGB IEC61966-2.1", null));

        //Setting some required parameters
        pdf.setTagged();
        pdf.getCatalog().setLang(new PdfString("en-US"));
        pdf.getCatalog().setViewerPreferences(
            new PdfViewerPreferences().setDisplayDocTitle(true));
        PdfDocumentInfo info = pdf.getDocumentInfo();
        info.setTitle("iText7 PDF/A-1a example");

        //Create PdfMerger instance
        PdfMerger merger = new PdfMerger(pdf);
        //Add pages from the first document

        for (File filePdf : listFile) {
            System.out.println("filePdf = " +filePdf.getName());
            PdfDocument firstSourcePdf = new PdfDocument(new PdfReader(filePdf));
            merger.merge(firstSourcePdf, 1, firstSourcePdf.getNumberOfPages());
            pdf.flushCopiedObjects(firstSourcePdf);
            firstSourcePdf.close();
        }

        pdf.close();

Thank You

Answer 1

This is a known issue when merging a large amount of PDF documents (or large PDFs).

iText will try to make the resulting PDF as small as possible. It does this by trying to reuse objects. For instance, if you have an image that occurs multiple times, in stead of embedding that image every time, it will embed it once and simply use a reference for the other occurrences.

That means iText has to keep all objects in memory, because there is no way of knowing beforehand whether an object will get reused.

A solution that usually helps is splitting the process in batches. In stead of merging 1000 files into 1, try merging 1000 files in pairs (resulting in 500 documents) and then merge each of those in pairs (resulting in 250 documents) and so on.

That allows iText to flush the buffer regularly, which should stop the memory overhead from crashing the VM.

Answer 2

If it doesn't have to be iText, you could try using a command line application that supports merging of files. PDFtk , QPDF and HexaPDF CLI (note: I'm the author of HexaPDF) are some CLI tools that support basic PDF file merging.

How to merge many pdfs

Question

2 answers

solution1
4 ACCPTED 2017-09-29 09:07:19

solution2
0 2017-09-30 08:31:55

How to merge many pdfs

Question

2 answers

solution1 4 ACCPTED 2017-09-29 09:07:19

solution2 0 2017-09-30 08:31:55

solution1
4 ACCPTED 2017-09-29 09:07:19

solution2
0 2017-09-30 08:31:55