I have a successfully implemented a pdf merge solution using PDFBox using InputStreams
. However, when I try to merge a document that is of a very large size I receive the following error:
Caused by: java.io.IOException: Missing root object specification in trailer.
at org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(COSParser.java:2832) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:173) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1144) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1060) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.multipdf.PDFMergerUtility.legacyMergeDocuments(PDFMergerUtility.java:379) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:280) ~[pdfbox-2.0.11.jar:2.0.11]
Of more importance (I think) are these statements that occur just before the error:
FINE (pdfparser.COSParser) [] - Missing end of file marker '%%EOF'
FINE (pdfparser.COSParser) [] - Set missing offset 388 for object 2 0 R
It seems to me that it can't find the '%%EOF'
marker in very large files. Now I know that it is indeed there because I can look at the source (unfortunately I can't provide the file itself).
Doing some searching online I found that there is a setEOFLookupRange()
method on the COSParser
class. I'm wondering if perhaps the lookup range is too small and that is why it can't find the '%%EOF'
marker. The problem is...I'm not using the COSParser
object at all in my code; I'm only using the PDFMergerUtility
class. The PDFMergerUtility
seems to be using the COSParser
under the hood.
So my questions are
EOFLookupRange
correct? PDFMergerUtility
in my code and not the COSParser
object? Many thanks for your time!
private boolean getCoolDocuments(final String slateId, final String filePathAndName)
throws IOException {
boolean status = false;
InputStream pdfStream = null;
HttpURLConnection connection = null;
final PDFMergerUtility merger = new PDFMergerUtility();
final ByteArrayOutputStream mergedPdfOutputStream = new ByteArrayOutputStream();
try {
final List<SlateDocument> parsedSlateDocuments = this.getSpecificDocumentsFromSlate(slateId);
if (!parsedSlateDocuments.isEmpty()) {
// iterate through each document, adding each pdf stream to the merger utility
int numberOfDocuments = 0;
for (final SlateDocument slateDocument : parsedSlateDocuments) {
final String url = this.getBaseURL() + "/slate/" + slateId + "/documents/"
+ slateDocument.getDocumentId();
/* code for RequestResponseUtil.initializeRequest(...) below */
connection = RequestResponseUtil.initializeRequest(url, "GET", this.getAuthenticationHeader(),
true, MediaType.APPLICATION_PDF_VALUE);
if (RequestResponseUtil.isSuccessful(connection.getResponseCode())) {
pdfStream = connection.getInputStream();
}
else {
/* do various things */
}
merger.addSource(pdfStream);
numberOfDocuments++;
}
merger.setDestinationStream(mergedPdfOutputStream);
// merge the all the pdf streams together
merger.mergeDocuments(MemoryUsageSetting.setupTempFileOnly());
status = true;
}
else {
LOG.severe("An error occurred while parsing the slated documents; no documents remain after parsing!");
}
}
finally {
RequestResponseUtil.close(pdfStream);
this.disconnect(connection);
}
return status;
}
public static HttpURLConnection initializeRequest(final String url, final String method,
final String httpAuthHeader, final boolean multiPartFormData, final String reponseType) {
HttpURLConnection conn = null;
try {
conn = (HttpURLConnection) new URL(url).openConnection();
conn.setRequestMethod(method);
conn.setRequestProperty("X-Slater-Authentication", httpAuthHeader);
conn.setRequestProperty("Accept", reponseType);
if (multiPartFormData) {
conn.setRequestProperty("Content-Type", "multipart/form-data; boundary=BOUNDARY");
conn.setDoOutput(true);
}
else {
conn.setRequestProperty("Content-Type", "application/xml");
}
}
catch (final MalformedURLException e) {
throw new CustomException(e);
}
catch (final IOException e) {
throw new CustomException(e);
}
return conn;
}
As I suspected, this was an issue with the InputStream
. It wasn't exactly what I thought, but basically I was making the (very wrong) assumption that I could just do this:
pdfStream = connection.getInputStream();
/* ... */
merger.addSource(pdfStream);
Of course, that's not going to work because the entire InputStream
may or may not be read. It needs to be read in explicitly until the last -1 byte is reached. I'm pretty sure that on the smaller files this was working fine and actually reading in the entire stream, but on the larger files it simply wasn't making it to the end...hence not finding the %%EOF
marker.
The solution was to use an intermediary ByteArrayOutputStream
and then convert that back to an InputStream
via a ByteArrayInputStream
.
So if you replace this line of code:
pdfStream = connection.getInputStream();
above with this code:
final ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
int c;
while ((c = connection.getInputStream().read()) != -1) {
byteArrayOutputStream.write(c);
}
pdfStream = new ByteArrayInputStream(byteArrayOutputStream.toByteArray());
you'll end up with a working example.
I may end up changing this to implementation to use Pipes or Circular Buffers instead , but at least this is working for now.
While this wasn't necessarily a Java 101 mistake, it was more like a Java 102 mistake and is still shameful. :/ Hopefully it will help someone else.
Thanks to @Tilman Hausherr and @Master_ex for all there help!
I took a look in the code and found out that the default EOFLookupRange
in COSParser
is 2048
bytes .
I think that your assumption is valid.
Looking the PDFParser
which extends the COSParser
and is the parser used internally by the PDFMergerUtility
I see that it is possible to set another EOFLookupRange
by using a system property . The system property name is org.apache.pdfbox.pdfparser.nonSequentialPDFParser.eofLookupRange
and it should be a valid integer.
Here is a question demonstrating how to set system properties.
I haven't tested the above but I hope it will work :)
The links to the PDFBox code use the 2.0.11 version which is the one that you are using.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.