pdfbox Load - java.net.SocketException: Connection reset --PDDocument.load()

Question

I am trying to merge multiple pdf files using pdfbox utility, below code works fine when i have less number of files. While doing merge for 3000+ files i am getting

connection reset exception at source = PDDocument.load(is);

Tried debugging but no much luck

    for(allinputfiles)
    {    AmazonS3URI s3URI = new AmazonS3URI(fileToBeDownloaded);
        S3Object s3Object = s3Client.getObject(s3URI.getBucket(), s3URI.getKey());
        S3ObjectInputStream s3InputStream = s3Object.getObjectContent();
MyS3ObjectStream.add(s3InputStream );
return MyS3ObjectStream;
    }


                    



 PDDocument destination = new PDDocument();
    PDDocument source;
for (MyS3ObjectStream s3fileobj : MyS3ObjectStream) {
                
                
                PDFMergerUtility pdfMerger = new PDFMergerUtility();
                pdfMerger.setDestinationFileName(MergedFile.pdf);
                
                try (InputStream is = s3fileobj.getS3ObjectInputStream())
                {
                      source = PDDocument.load(is);
                }
                catch (Exception e)
                {
                    LOG.error("Error in Loading PDF file for conversion");
                    continue;
                }
                
                try
                {
                    pdfMerger.appendDocument(destination, source);
                    destination.save(MergedFile.pdf);
                
                }
                catch (Exception e)
                {   LOG.error("Error in PDF Append method , response added to failed file.");
                    
                    continue;
                    
                }
                finally
                {
                         source.close();
                   
                }
            }

Aim is to merge 3000+ documents using pdfbox.

Exception

Error in Loading PDF file for conversion
java.net.SocketException: Connection reset
    at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:323)
    at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:350)
    at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:803)
    at java.base/java.net.Socket$SocketInputStream.read(Socket.java:966)
    at java.base/sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:478)
    at java.base/sun.security.ssl.SSLSocketInputRecord.readFully(SSLSocketInputRecord.java:461)
    at java.base/sun.security.ssl.SSLSocketInputRecord.decodeInputRecord(SSLSocketInputRecord.java:243)
    at java.base/sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:181)
    at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:111)
    at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1508)
    at java.base/sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1479)
    at java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1064)
    at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
    at org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197)
    at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176)
    at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135)
    at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90)
    at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
    at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90)
    at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90)
    at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90)
    at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
    at java.base/java.security.DigestInputStream.read(DigestInputStream.java:162)
    at com.amazonaws.services.s3.internal.DigestValidationInputStream.read(DigestValidationInputStream.java:59)
    at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90)
    at com.amazonaws.services.s3.internal.S3AbortableInputStream.read(S3AbortableInputStream.java:125)
    at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90)
    at java.base/java.io.FilterInputStream.read(FilterInputStream.java:106)
    at org.apache.pdfbox.io.ScratchFile.createBuffer(ScratchFile.java:443)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1228)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1130)

Answer 1

your try catch finally logic is broken. you're only "finally" closing stuff in the second try{/**/}catch routine.

Answer 2

I am sorry because the bounty ended and nobody gave you a complete answer. I will try to do my best now.

The problem you are experiencing is related with the fact that your PDF merge process takes some time.

On the other hand you are opening all the input streams from the S3 objects in the beginning of your code, and only close them at the end of your processing.

Due to the great amount of files you are handling and the time required by the computation, AWS eventually close some of the underlying network connections.

Probably there would be another options, but please, try something like the following:

PDDocument destination = new PDDocument();
PDFMergerUtility pdfMerger = new PDFMergerUtility();
pdfMerger.setDestinationFileName(MergedFile.pdf);

for(allinputfiles){
  AmazonS3URI s3URI = new AmazonS3URI(fileToBeDownloaded);
  S3Object s3Object = s3Client.getObject(s3URI.getBucket(), s3URI.getKey());

  // Both objects, s3InputStream and source, will be closed automatically
  try (
    S3ObjectInputStream s3InputStream = s3Object.getObjectContent();
    PDDocument source = PDDocument.load(s3InputStream);
  ) {
    pdfMerger.appendDocument(destination, source);
  } catch (Exception e) {
    // log as much context as you can with the solely restriction of
    // not providing sensitive information in the output
    LOG.error("Error in PDF Append method , response added to failed file while processing '" + fileToBeDownloaded + "'", e);
    continue;
  }

  // I am not sure about the location of this line of code
  // I moved it but perhaps the location in your original code
  // is fine or even better. Please, test it
  destination.save(MergedFile.pdf);
}

I didn't tested the actual code, so sorry if I make a mistake. I hope it helps in any way.

Having said that, as I try explaining in my comments, be aware that this process could be computationally very expensive and that the PDF obtained as result could be in a certain way not easily affordable for both a human or even a machine.

pdfbox Load - java.net.SocketException: Connection reset --PDDocument.load()

Question

1 answers

solution1
0 2022-09-12 11:38:24

solution2
0 2022-09-14 22:26:11

pdfbox Load - java.net.SocketException: Connection reset --PDDocument.load()

Question

1 answers

solution1 0 2022-09-12 11:38:24

solution2 0 2022-09-14 22:26:11

solution1
0 2022-09-12 11:38:24

solution2
0 2022-09-14 22:26:11