[英]Error Merging Large PDF Files with PDFBox - Missing end of file marker '%%EOF'
I have a successfully implemented a pdf merge solution using PDFBox using InputStreams
. 我已经使用
InputStreams
使用PDFBox成功实现了pdf合并解决方案。 However, when I try to merge a document that is of a very large size I receive the following error: 但是,当我尝试合并非常大的文档时,出现以下错误:
Caused by: java.io.IOException: Missing root object specification in trailer.
at org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(COSParser.java:2832) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:173) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1144) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1060) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.multipdf.PDFMergerUtility.legacyMergeDocuments(PDFMergerUtility.java:379) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:280) ~[pdfbox-2.0.11.jar:2.0.11]
Of more importance (I think) are these statements that occur just before the error: 我认为,更重要的是在错误之前发生的这些语句:
FINE (pdfparser.COSParser) [] - Missing end of file marker '%%EOF'
FINE (pdfparser.COSParser) [] - Set missing offset 388 for object 2 0 R
It seems to me that it can't find the '%%EOF'
marker in very large files. 在我看来,它在非常大的文件中找不到
'%%EOF'
标记。 Now I know that it is indeed there because I can look at the source (unfortunately I can't provide the file itself). 现在我知道它确实存在,因为我可以查看源代码(不幸的是我无法提供文件本身)。
Doing some searching online I found that there is a setEOFLookupRange()
method on the COSParser
class. 在网上进行一些搜索后,我发现
COSParser
类上有一个setEOFLookupRange()
方法。 I'm wondering if perhaps the lookup range is too small and that is why it can't find the '%%EOF'
marker. 我想知道查询范围是否太小,这就是为什么它找不到
'%%EOF'
标记的原因。 The problem is...I'm not using the COSParser
object at all in my code; 问题是...我的代码中根本没有使用
COSParser
对象。 I'm only using the PDFMergerUtility
class. 我只使用
PDFMergerUtility
类。 The PDFMergerUtility
seems to be using the COSParser
under the hood. PDFMergerUtility
似乎在COSParser
使用COSParser
。
So my questions are 所以我的问题是
EOFLookupRange
correct? EOFLookupRange
假设正确吗? PDFMergerUtility
in my code and not the COSParser
object? PDFMergerUtility
而不包含COSParser
对象的范围? Many thanks for your time! 非常感谢您的宝贵时间!
private boolean getCoolDocuments(final String slateId, final String filePathAndName)
throws IOException {
boolean status = false;
InputStream pdfStream = null;
HttpURLConnection connection = null;
final PDFMergerUtility merger = new PDFMergerUtility();
final ByteArrayOutputStream mergedPdfOutputStream = new ByteArrayOutputStream();
try {
final List<SlateDocument> parsedSlateDocuments = this.getSpecificDocumentsFromSlate(slateId);
if (!parsedSlateDocuments.isEmpty()) {
// iterate through each document, adding each pdf stream to the merger utility
int numberOfDocuments = 0;
for (final SlateDocument slateDocument : parsedSlateDocuments) {
final String url = this.getBaseURL() + "/slate/" + slateId + "/documents/"
+ slateDocument.getDocumentId();
/* code for RequestResponseUtil.initializeRequest(...) below */
connection = RequestResponseUtil.initializeRequest(url, "GET", this.getAuthenticationHeader(),
true, MediaType.APPLICATION_PDF_VALUE);
if (RequestResponseUtil.isSuccessful(connection.getResponseCode())) {
pdfStream = connection.getInputStream();
}
else {
/* do various things */
}
merger.addSource(pdfStream);
numberOfDocuments++;
}
merger.setDestinationStream(mergedPdfOutputStream);
// merge the all the pdf streams together
merger.mergeDocuments(MemoryUsageSetting.setupTempFileOnly());
status = true;
}
else {
LOG.severe("An error occurred while parsing the slated documents; no documents remain after parsing!");
}
}
finally {
RequestResponseUtil.close(pdfStream);
this.disconnect(connection);
}
return status;
}
public static HttpURLConnection initializeRequest(final String url, final String method,
final String httpAuthHeader, final boolean multiPartFormData, final String reponseType) {
HttpURLConnection conn = null;
try {
conn = (HttpURLConnection) new URL(url).openConnection();
conn.setRequestMethod(method);
conn.setRequestProperty("X-Slater-Authentication", httpAuthHeader);
conn.setRequestProperty("Accept", reponseType);
if (multiPartFormData) {
conn.setRequestProperty("Content-Type", "multipart/form-data; boundary=BOUNDARY");
conn.setDoOutput(true);
}
else {
conn.setRequestProperty("Content-Type", "application/xml");
}
}
catch (final MalformedURLException e) {
throw new CustomException(e);
}
catch (final IOException e) {
throw new CustomException(e);
}
return conn;
}
As I suspected, this was an issue with the InputStream
. 我怀疑这是
InputStream
的问题。 It wasn't exactly what I thought, but basically I was making the (very wrong) assumption that I could just do this: 这并不是我真正想的,但基本上我是在(非常错误)的假设下做出这样的假设:
pdfStream = connection.getInputStream();
/* ... */
merger.addSource(pdfStream);
Of course, that's not going to work because the entire InputStream
may or may not be read. 当然,这将无法正常工作,因为可能会读取或可能不会读取整个
InputStream
。 It needs to be read in explicitly until the last -1 byte is reached. 需要显式读取它,直到到达最后一个-1字节为止。 I'm pretty sure that on the smaller files this was working fine and actually reading in the entire stream, but on the larger files it simply wasn't making it to the end...hence not finding the
%%EOF
marker. 我很确定在较小的文件上它可以正常工作,并且实际上可以在整个流中读取,但是在较大的文件上,它根本没有达到目的...因此找不到
%%EOF
标记。
The solution was to use an intermediary ByteArrayOutputStream
and then convert that back to an InputStream
via a ByteArrayInputStream
. 解决方案是使用中间的
ByteArrayOutputStream
,然后通过ByteArrayInputStream
将其转换回InputStream
。
So if you replace this line of code: 因此,如果您替换以下代码行:
pdfStream = connection.getInputStream();
above with this code: 上面的代码:
final ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
int c;
while ((c = connection.getInputStream().read()) != -1) {
byteArrayOutputStream.write(c);
}
pdfStream = new ByteArrayInputStream(byteArrayOutputStream.toByteArray());
you'll end up with a working example. 您将得到一个可行的示例。
I may end up changing this to implementation to use Pipes or Circular Buffers instead , but at least this is working for now. 我可能最终会将其更改为实现,以改为使用Pipes或Circular Buffers ,但是至少目前为止这是可行的。
While this wasn't necessarily a Java 101 mistake, it was more like a Java 102 mistake and is still shameful. 尽管这不一定是Java 101错误,但它更像是Java 102错误,仍然很可耻。 :/ Hopefully it will help someone else.
:/希望它会帮助别人。
Thanks to @Tilman Hausherr and @Master_ex for all there help! 感谢@Tilman Hausherr和@Master_ex提供的所有帮助!
I took a look in the code and found out that the default EOFLookupRange
in COSParser
is 2048
bytes . 我看了一下代码,发现
EOFLookupRange
中的默认COSParser
为2048
字节 。
I think that your assumption is valid. 我认为您的假设是正确的。
Looking the PDFParser
which extends the COSParser
and is the parser used internally by the PDFMergerUtility
I see that it is possible to set another EOFLookupRange
by using a system property . 展望
PDFParser
延伸的COSParser
,是由内部使用的解析器PDFMergerUtility
我看到它,可以设置其他EOFLookupRange
通过使用系统属性 。 The system property name is org.apache.pdfbox.pdfparser.nonSequentialPDFParser.eofLookupRange
and it should be a valid integer. 系统属性名称是
org.apache.pdfbox.pdfparser.nonSequentialPDFParser.eofLookupRange
,它应该是有效的整数。
Here is a question demonstrating how to set system properties. 这是一个演示如何设置系统属性的问题。
I haven't tested the above but I hope it will work :) 我没有测试以上内容,但我希望它能起作用:)
The links to the PDFBox code use the 2.0.11 version which is the one that you are using. PDFBox代码的链接使用的是2.0.11版本。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.