简体   繁体   中英

Error Merging Large PDF Files with PDFBox - Missing end of file marker '%%EOF'

I have a successfully implemented a pdf merge solution using PDFBox using InputStreams . However, when I try to merge a document that is of a very large size I receive the following error:

Caused by: java.io.IOException: Missing root object specification in trailer.
at org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(COSParser.java:2832) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:173) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1144) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1060) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.multipdf.PDFMergerUtility.legacyMergeDocuments(PDFMergerUtility.java:379) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:280) ~[pdfbox-2.0.11.jar:2.0.11]

Of more importance (I think) are these statements that occur just before the error:

FINE (pdfparser.COSParser) [] - Missing end of file marker '%%EOF'
FINE (pdfparser.COSParser) [] - Set missing offset 388 for object 2 0 R

It seems to me that it can't find the '%%EOF' marker in very large files. Now I know that it is indeed there because I can look at the source (unfortunately I can't provide the file itself).

Doing some searching online I found that there is a setEOFLookupRange() method on the COSParser class. I'm wondering if perhaps the lookup range is too small and that is why it can't find the '%%EOF' marker. The problem is...I'm not using the COSParser object at all in my code; I'm only using the PDFMergerUtility class. The PDFMergerUtility seems to be using the COSParser under the hood.

So my questions are

  1. Is my hypothesis about the EOFLookupRange correct?
  2. If so, how can I set that range only having the PDFMergerUtility in my code and not the COSParser object?

Many thanks for your time!

UPDATED with code below

 private boolean getCoolDocuments(final String slateId, final String filePathAndName)
            throws IOException {

        boolean status = false;
        InputStream pdfStream = null;
        HttpURLConnection connection = null;
        final PDFMergerUtility merger = new PDFMergerUtility();
        final ByteArrayOutputStream mergedPdfOutputStream = new ByteArrayOutputStream();

        try {

            final List<SlateDocument> parsedSlateDocuments = this.getSpecificDocumentsFromSlate(slateId);

            if (!parsedSlateDocuments.isEmpty()) {

                // iterate through each document, adding each pdf stream to the merger utility
                int numberOfDocuments = 0;
                for (final SlateDocument slateDocument : parsedSlateDocuments) {

                    final String url = this.getBaseURL() + "/slate/" + slateId + "/documents/"
                            + slateDocument.getDocumentId();

                     /* code for RequestResponseUtil.initializeRequest(...) below */
                    connection = RequestResponseUtil.initializeRequest(url, "GET", this.getAuthenticationHeader(),
                            true, MediaType.APPLICATION_PDF_VALUE);

                    if (RequestResponseUtil.isSuccessful(connection.getResponseCode())) {
                        pdfStream = connection.getInputStream();

                    }
                    else {
                        /* do various things */
                    }

                    merger.addSource(pdfStream);
                    numberOfDocuments++;
                }

                merger.setDestinationStream(mergedPdfOutputStream);

                // merge the all the pdf streams together
               merger.mergeDocuments(MemoryUsageSetting.setupTempFileOnly());

               status = true;
            }
            else {
                LOG.severe("An error occurred while parsing the slated documents; no documents remain after parsing!");
            }
        }
        finally {
            RequestResponseUtil.close(pdfStream);

            this.disconnect(connection);
        }

        return status;
    }

   public static HttpURLConnection initializeRequest(final String url, final String method,
            final String httpAuthHeader, final boolean multiPartFormData, final String reponseType) {

    HttpURLConnection conn = null;

    try {
        conn = (HttpURLConnection) new URL(url).openConnection();
        conn.setRequestMethod(method);
        conn.setRequestProperty("X-Slater-Authentication", httpAuthHeader);
        conn.setRequestProperty("Accept", reponseType);
        if (multiPartFormData) {
            conn.setRequestProperty("Content-Type", "multipart/form-data; boundary=BOUNDARY");
            conn.setDoOutput(true);
        }
        else {
            conn.setRequestProperty("Content-Type", "application/xml");
        }
    }
    catch (final MalformedURLException e) {
        throw new CustomException(e);
    }
    catch (final IOException e) {
        throw new CustomException(e);
    }
    return conn;

}

As I suspected, this was an issue with the InputStream . It wasn't exactly what I thought, but basically I was making the (very wrong) assumption that I could just do this:

           pdfStream = connection.getInputStream();
                /* ... */
           merger.addSource(pdfStream);

Of course, that's not going to work because the entire InputStream may or may not be read. It needs to be read in explicitly until the last -1 byte is reached. I'm pretty sure that on the smaller files this was working fine and actually reading in the entire stream, but on the larger files it simply wasn't making it to the end...hence not finding the %%EOF marker.

The solution was to use an intermediary ByteArrayOutputStream and then convert that back to an InputStream via a ByteArrayInputStream .

So if you replace this line of code:

pdfStream = connection.getInputStream();

above with this code:

                final ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();

                int c;
                while ((c = connection.getInputStream().read()) != -1) {
                    byteArrayOutputStream.write(c);
                }

                pdfStream = new ByteArrayInputStream(byteArrayOutputStream.toByteArray());

you'll end up with a working example.

I may end up changing this to implementation to use Pipes or Circular Buffers instead , but at least this is working for now.

While this wasn't necessarily a Java 101 mistake, it was more like a Java 102 mistake and is still shameful. :/ Hopefully it will help someone else.

Thanks to @Tilman Hausherr and @Master_ex for all there help!

I took a look in the code and found out that the default EOFLookupRange in COSParser is 2048 bytes .

I think that your assumption is valid.

Looking the PDFParser which extends the COSParser and is the parser used internally by the PDFMergerUtility I see that it is possible to set another EOFLookupRange by using a system property . The system property name is org.apache.pdfbox.pdfparser.nonSequentialPDFParser.eofLookupRange and it should be a valid integer.

Here is a question demonstrating how to set system properties.

I haven't tested the above but I hope it will work :)

The links to the PDFBox code use the 2.0.11 version which is the one that you are using.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM