使用PDFBox合并大型PDF文件时出错-文件标记'%% EOF'丢失结尾

Question

I have a successfully implemented a pdf merge solution using PDFBox using InputStreams . 我已经使用InputStreams使用PDFBox成功实现了pdf合并解决方案。 However, when I try to merge a document that is of a very large size I receive the following error: 但是，当我尝试合并非常大的文档时，出现以下错误：

Caused by: java.io.IOException: Missing root object specification in trailer.
at org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(COSParser.java:2832) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:173) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1144) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1060) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.multipdf.PDFMergerUtility.legacyMergeDocuments(PDFMergerUtility.java:379) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:280) ~[pdfbox-2.0.11.jar:2.0.11]

Of more importance (I think) are these statements that occur just before the error: 我认为，更重要的是在错误之前发生的这些语句：

FINE (pdfparser.COSParser) [] - Missing end of file marker '%%EOF'
FINE (pdfparser.COSParser) [] - Set missing offset 388 for object 2 0 R

It seems to me that it can't find the '%%EOF' marker in very large files. 在我看来，它在非常大的文件中找不到'%%EOF'标记。 Now I know that it is indeed there because I can look at the source (unfortunately I can't provide the file itself). 现在我知道它确实存在，因为我可以查看源代码（不幸的是我无法提供文件本身）。

Doing some searching online I found that there is a setEOFLookupRange() method on the COSParser class. 在网上进行一些搜索后，我发现COSParser类上有一个setEOFLookupRange()方法。 I'm wondering if perhaps the lookup range is too small and that is why it can't find the '%%EOF' marker. 我想知道查询范围是否太小，这就是为什么它找不到'%%EOF'标记的原因。 The problem is...I'm not using the COSParser object at all in my code; 问题是...我的代码中根本没有使用COSParser对象。 I'm only using the PDFMergerUtility class. 我只使用PDFMergerUtility类。 The PDFMergerUtility seems to be using the COSParser under the hood. PDFMergerUtility似乎在COSParser使用COSParser 。

So my questions are 所以我的问题是

Is my hypothesis about the EOFLookupRange correct? 我对EOFLookupRange假设正确吗？
If so, how can I set that range only having the PDFMergerUtility in my code and not the COSParser object? 如果是这样，如何设置我的代码中仅包含PDFMergerUtility而不包含COSParser对象的范围？

Many thanks for your time! 非常感谢您的宝贵时间！

UPDATED with code below 用下面的代码更新

 private boolean getCoolDocuments(final String slateId, final String filePathAndName)
            throws IOException {

        boolean status = false;
        InputStream pdfStream = null;
        HttpURLConnection connection = null;
        final PDFMergerUtility merger = new PDFMergerUtility();
        final ByteArrayOutputStream mergedPdfOutputStream = new ByteArrayOutputStream();

        try {

            final List<SlateDocument> parsedSlateDocuments = this.getSpecificDocumentsFromSlate(slateId);

            if (!parsedSlateDocuments.isEmpty()) {

                // iterate through each document, adding each pdf stream to the merger utility
                int numberOfDocuments = 0;
                for (final SlateDocument slateDocument : parsedSlateDocuments) {

                    final String url = this.getBaseURL() + "/slate/" + slateId + "/documents/"
                            + slateDocument.getDocumentId();

                     /* code for RequestResponseUtil.initializeRequest(...) below */
                    connection = RequestResponseUtil.initializeRequest(url, "GET", this.getAuthenticationHeader(),
                            true, MediaType.APPLICATION_PDF_VALUE);

                    if (RequestResponseUtil.isSuccessful(connection.getResponseCode())) {
                        pdfStream = connection.getInputStream();

                    }
                    else {
                        /* do various things */
                    }

                    merger.addSource(pdfStream);
                    numberOfDocuments++;
                }

                merger.setDestinationStream(mergedPdfOutputStream);

                // merge the all the pdf streams together
               merger.mergeDocuments(MemoryUsageSetting.setupTempFileOnly());

               status = true;
            }
            else {
                LOG.severe("An error occurred while parsing the slated documents; no documents remain after parsing!");
            }
        }
        finally {
            RequestResponseUtil.close(pdfStream);

            this.disconnect(connection);
        }

        return status;
    }

   public static HttpURLConnection initializeRequest(final String url, final String method,
            final String httpAuthHeader, final boolean multiPartFormData, final String reponseType) {

    HttpURLConnection conn = null;

    try {
        conn = (HttpURLConnection) new URL(url).openConnection();
        conn.setRequestMethod(method);
        conn.setRequestProperty("X-Slater-Authentication", httpAuthHeader);
        conn.setRequestProperty("Accept", reponseType);
        if (multiPartFormData) {
            conn.setRequestProperty("Content-Type", "multipart/form-data; boundary=BOUNDARY");
            conn.setDoOutput(true);
        }
        else {
            conn.setRequestProperty("Content-Type", "application/xml");
        }
    }
    catch (final MalformedURLException e) {
        throw new CustomException(e);
    }
    catch (final IOException e) {
        throw new CustomException(e);
    }
    return conn;

}

Answer 1

As I suspected, this was an issue with the InputStream . 我怀疑这是InputStream的问题。 It wasn't exactly what I thought, but basically I was making the (very wrong) assumption that I could just do this: 这并不是我真正想的，但基本上我是在（非常错误）的假设下做出这样的假设：

           pdfStream = connection.getInputStream();
                /* ... */
           merger.addSource(pdfStream);

Of course, that's not going to work because the entire InputStream may or may not be read. 当然，这将无法正常工作，因为可能会读取或可能不会读取整个InputStream 。 It needs to be read in explicitly until the last -1 byte is reached. 需要显式读取它，直到到达最后一个-1字节为止。 I'm pretty sure that on the smaller files this was working fine and actually reading in the entire stream, but on the larger files it simply wasn't making it to the end...hence not finding the %%EOF marker. 我很确定在较小的文件上它可以正常工作，并且实际上可以在整个流中读取，但是在较大的文件上，它根本没有达到目的...因此找不到%%EOF标记。

The solution was to use an intermediary ByteArrayOutputStream and then convert that back to an InputStream via a ByteArrayInputStream . 解决方案是使用中间的ByteArrayOutputStream ，然后通过ByteArrayInputStream将其转换回InputStream 。

So if you replace this line of code: 因此，如果您替换以下代码行：

pdfStream = connection.getInputStream();

above with this code: 上面的代码：

                final ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();

                int c;
                while ((c = connection.getInputStream().read()) != -1) {
                    byteArrayOutputStream.write(c);
                }

                pdfStream = new ByteArrayInputStream(byteArrayOutputStream.toByteArray());

you'll end up with a working example. 您将得到一个可行的示例。

I may end up changing this to implementation to use Pipes or Circular Buffers instead , but at least this is working for now. 我可能最终会将其更改为实现，以改为使用Pipes或Circular Buffers ，但是至少目前为止这是可行的。

While this wasn't necessarily a Java 101 mistake, it was more like a Java 102 mistake and is still shameful. 尽管这不一定是Java 101错误，但它更像是Java 102错误，仍然很可耻。 :/ Hopefully it will help someone else. ：/希望它会帮助别人。

Thanks to @Tilman Hausherr and @Master_ex for all there help! 感谢@Tilman Hausherr和@Master_ex提供的所有帮助！

Answer 2

I took a look in the code and found out that the default EOFLookupRange in COSParser is 2048 bytes . 我看了一下代码，发现EOFLookupRange中的默认COSParser为2048字节。

I think that your assumption is valid. 我认为您的假设是正确的。

Looking the PDFParser which extends the COSParser and is the parser used internally by the PDFMergerUtility I see that it is possible to set another EOFLookupRange by using a system property . 展望PDFParser延伸的COSParser ，是由内部使用的解析器PDFMergerUtility我看到它，可以设置其他EOFLookupRange通过使用系统属性。 The system property name is org.apache.pdfbox.pdfparser.nonSequentialPDFParser.eofLookupRange and it should be a valid integer. 系统属性名称是org.apache.pdfbox.pdfparser.nonSequentialPDFParser.eofLookupRange ，它应该是有效的整数。

Here is a question demonstrating how to set system properties. 这是一个演示如何设置系统属性的问题。

I haven't tested the above but I hope it will work :) 我没有测试以上内容，但我希望它能起作用:)

_{^{The links to the PDFBox code use the 2.0.11 version which is the one that you are using.}} _{^{PDFBox代码的链接使用的是2.0.11版本。}}

使用PDFBox合并大型PDF文件时出错-文件标记'%% EOF'丢失结尾

问题描述

UPDATED with code below 用下面的代码更新

2 个解决方案

解决方案1
2 已采纳 2018-08-01 02:03:57

解决方案2
0 2018-07-27 21:09:40

使用PDFBox合并大型PDF文件时出错-文件标记&#39;%% EOF&#39;丢失结尾

问题描述

UPDATED with code below 用下面的代码更新

2 个解决方案

解决方案1 2 已采纳 2018-08-01 02:03:57

解决方案2 0 2018-07-27 21:09:40

使用PDFBox合并大型PDF文件时出错-文件标记'%% EOF'丢失结尾

解决方案1
2 已采纳 2018-08-01 02:03:57

解决方案2
0 2018-07-27 21:09:40