简体   繁体   中英

Batching multiple files to Amazon S3 using the Java SDK

I'm trying to upload multiple files to Amazon S3 all under the same key, by appending the files. I have a list of file names and want to upload/append the files in that order. I am pretty much exactly following this tutorial but I am looping through each file first and uploading that in part. Because the files are on hdfs (the Path is actually org.apache.hadoop.fs.Path), I am using the input stream to send the file data. Some pseudocode is below (I am commenting the blocks that are word for word from the tutorial):

// Create a list of UploadPartResponse objects. You get one of these for
// each part upload.
List<PartETag> partETags = new ArrayList<PartETag>();

// Step 1: Initialize.
InitiateMultipartUploadRequest initRequest = new InitiateMultipartUploadRequest(
        bk.getBucket(), bk.getKey());
InitiateMultipartUploadResult initResponse =
        s3Client.initiateMultipartUpload(initRequest);
try {
      int i = 1; // part number
      for (String file : files) {
        Path filePath = new Path(file);

        // Get the input stream and content length
        long contentLength = fss.get(branch).getFileStatus(filePath).getLen();
        InputStream is = fss.get(branch).open(filePath);

        long filePosition = 0;
        while (filePosition < contentLength) {
            // create request
            //upload part and add response to our list
            i++;
        }
    }
    // Step 3: Complete.
    CompleteMultipartUploadRequest compRequest = new
          CompleteMultipartUploadRequest(bk.getBucket(),
          bk.getKey(),
          initResponse.getUploadId(),
          partETags);

    s3Client.completeMultipartUpload(compRequest);
} catch (Exception e) {
      //...
}

However, I am getting the following error:

com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: 2C1126E838F65BB9), S3 Extended Request ID: QmpybmrqepaNtTVxWRM1g2w/fYW+8DPrDwUEK1XeorNKtnUKbnJeVM6qmeNcrPwc
    at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1109)
    at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:741)
    at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:461)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:296)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3743)
    at com.amazonaws.services.s3.AmazonS3Client.completeMultipartUpload(AmazonS3Client.java:2617)

If anyone knows what the cause of this error might be, that would be greatly appreciated. Alternatively, if there is a better way to concatenate a bunch of files into one s3 key, that would be great as well. I tried using java's builtin SequenceInputStream but that did not work. Any help would be greatly appreciated. For reference, the total size of all the files could be as large as 10-15 gb.

I know it's probably a bit late but worth giving my contribution. I've managed to solve a similar problem using the SequenceInputStream .

The tricks is in being able to calculate the total size of the result file and then feeding the SequenceInputStream with an Enumeration<InputStream> .

Here's some example code that might help:

public void combineFiles() {
    List<String> files = getFiles();
    long totalFileSize = files.stream()
                               .map(this::getContentLength)
                               .reduce(0L, (f, s) -> f + s);

    try {
        try (InputStream partialFile = new SequenceInputStream(getInputStreamEnumeration(files))) {
            ObjectMetadata resultFileMetadata = new ObjectMetadata();
            resultFileMetadata.setContentLength(totalFileSize);
            s3Client.putObject("bucketName", "resultFilePath", partialFile, resultFileMetadata);
        }
    } catch (IOException e) {
        LOG.error("An error occurred while combining files. {}", e);
    }
}

private Enumeration<? extends InputStream> getInputStreamEnumeration(List<String> files) {
    return new Enumeration<InputStream>() {
        private Iterator<String> fileNamesIterator = files.iterator();

        @Override
        public boolean hasMoreElements() {
            return fileNamesIterator.hasNext();
        }

        @Override
        public InputStream nextElement() {
            try {
                return new FileInputStream(Paths.get(fileNamesIterator.next()).toFile());
            } catch (FileNotFoundException e) {
                System.err.println(e.getMessage());
                throw new RuntimeException(e);
            }
        }

    };
}

Hope this helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM