简体   繁体   English

如何使用aws-java-sdk从S3逐块读取文件

[英]How to read file chunk by chunk from S3 using aws-java-sdk

I am trying to read large file into chunks from S3 without cutting any line for parallel processing.我正在尝试从 S3 中将大文件读入块中,而不切断任何行以进行并行处理。

Let me explain by example: There is file of size 1G on S3.让我举例说明:S3 上有一个大小为 1G 的文件。 I want to divide this file into chucks of 64 MB.我想将这个文件分成 64 MB 的块。 It is easy I can do it like:我可以很容易地做到这一点:

S3Object s3object = s3.getObject(new GetObjectRequest(bucketName, key));

InputStream stream = s3object.getObjectContent();

byte[] content = new byte[64*1024*1024];

while (stream.read(content)  != -1) {

//process content here 

}

but problem with chunk is it may have 100 complete line and one incomplete.但是块的问题是它可能有 100 行完整的行和一个不完整的行。 but I can not process incomplete line and don't want to discard it.但我无法处理不完整的行,也不想丢弃它。

Is any way to handle this situations?有什么办法可以处理这种情况? means all chucks have no partial line.表示所有夹头都没有偏线。

My usual approach ( InputStream -> BufferedReader.lines() -> batches of lines -> CompletableFuture ) won't work here because the underlying S3ObjectInputStream times out eventually for huge files.我通常的方法( InputStream -> BufferedReader.lines() -> batches of lines -> CompletableFuture )在这里不起作用,因为底层的S3ObjectInputStream最终会因大文件而超时。

So I created a new class S3InputStream , which doesn't care how long it's open for and reads byte blocks on demand using short-lived AWS SDK calls.所以我创建了一个新类S3InputStream ,它不关心它打开多长时间,并使用短期 AWS 开发工具包调用按需读取字节块。 You provide a byte[] that will be reused.您提供一个将被重用的byte[] new byte[1 << 24] (16Mb) appears to work well. new byte[1 << 24] (16Mb) 似乎运行良好。

package org.harrison;

import java.io.IOException;
import java.io.InputStream;

import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.GetObjectRequest;

/**
 * An {@link InputStream} for S3 files that does not care how big the file is.
 *
 * @author stephen harrison
 */
public class S3InputStream extends InputStream {
    private static class LazyHolder {
        private static final AmazonS3 S3 = AmazonS3ClientBuilder.defaultClient();
    }

    private final String bucket;
    private final String file;
    private final byte[] buffer;
    private long lastByteOffset;

    private long offset = 0;
    private int next = 0;
    private int length = 0;

    public S3InputStream(final String bucket, final String file, final byte[] buffer) {
        this.bucket = bucket;
        this.file = file;
        this.buffer = buffer;
        this.lastByteOffset = LazyHolder.S3.getObjectMetadata(bucket, file).getContentLength() - 1;
    }

    @Override
    public int read() throws IOException {
        if (next >= length) {
            fill();

            if (length <= 0) {
                return -1;
            }

            next = 0;
        }

        if (next >= length) {
            return -1;
        }

        return buffer[this.next++];
    }

    public void fill() throws IOException {
        if (offset >= lastByteOffset) {
            length = -1;
        } else {
            try (final InputStream inputStream = s3Object()) {
                length = 0;
                int b;

                while ((b = inputStream.read()) != -1) {
                    buffer[length++] = (byte) b;
                }

                if (length > 0) {
                    offset += length;
                }
            }
        }
    }

    private InputStream s3Object() {
        final GetObjectRequest request = new GetObjectRequest(bucket, file).withRange(offset,
                offset + buffer.length - 1);

        return LazyHolder.S3.getObject(request).getObjectContent();
    }
}

The aws-java-sdk already provides streaming functionality for your S3 objects. aws-java-sdk 已经为您的 S3 对象提供了流式处理功能。 You have to call "getObject" and the result will be an InputStream.您必须调用“getObject”,结果将是一个 InputStream。

1) AmazonS3Client.getObject(GetObjectRequest getObjectRequest) -> S3Object 1) AmazonS3Client.getObject(GetObjectRequest getObjectRequest) -> S3Object

2) S3Object.getObjectContent() 2) S3Object.getObjectContent()

Note: The method is a simple getter and does not actually create a stream.注意:该方法是一个简单的 getter,实际上并不创建流。 If you retrieve an S3Object, you should close this input stream as soon as possible, because the object contents aren't buffered in memory and stream directly from Amazon S3.如果您检索 S3Object,则应尽快关闭此输入流,因为对象内容不会缓冲在内存中,而是直接从 Amazon S3 流式传输。 Further, failure to close this stream can cause the request pool to become blocked.此外,未能关闭此流可能会导致请求池被阻塞。

aws java docs aws java 文档

100 complete line and one incomplete 100 条完整线和 1 条不完整线

do you mean you need to read the stream line by line?你的意思是你需要逐行阅读流水线吗? If so, instead of using a an InputStream try to read the s3 object stream by using BufferedReader so that you can read the stream line by line but I think this will make a little slower than by chunk.如果是这样,而不是使用 InputStream 尝试使用 BufferedReader 读取 s3 对象流,以便您可以逐行读取流,但我认为这会比逐块慢一点。

        S3Object s3object = s3.getObject(new GetObjectRequest(bucketName, key));
        BufferedReader in = new BufferedReader(new InputStreamReader(s3object.getObjectContent()));
        String line;
        while ((line = in.readLine()) != null)  {

//process line here

        }

The @stephen-harrison answer works well. @stephen-harrison 的答案效果很好。 I updated it for v2 of the sdk.我为 sdk 的 v2 更新了它。 I made a couple of tweaks: mainly the connection can now be authorized and the LazyHolder class is no longer static -- I couldn't figure out how to authorize the connection and still keep the class static.我做了一些调整:主要是现在可以授权连接,并且 LazyHolder 类不再是静态的——我不知道如何授权连接并仍然保持类静态。

For another approach using Scala, see https://alexwlchan.net/2019/09/streaming-large-s3-objects/有关使用 Scala 的另一种方法,请参阅https://alexwlchan.net/2019/09/streaming-large-s3-objects/

    package foo.whatever;

    import java.io.IOException;
    import java.io.InputStream;

     import software.amazon.awssdk.auth.credentials.AwsBasicCredentials;
     import software.amazon.awssdk.auth.credentials.StaticCredentialsProvider;
     import software.amazon.awssdk.regions.Region;
     import software.amazon.awssdk.services.s3.S3Client;
     import software.amazon.awssdk.services.s3.model.GetObjectRequest;
     import software.amazon.awssdk.services.s3.model.HeadObjectRequest;
     import software.amazon.awssdk.services.s3.model.HeadObjectResponse;

    /**
     * Adapted for aws Java sdk v2 by jomofrodo@gmail.com
     * 
     * An {@link InputStream} for S3 files that does not care how big the file   is.
     *
     * @author stephen harrison
     */
   public class S3InputStreamV2 extends InputStream {
       private class LazyHolder {
           String appID;
           String secretKey;
           Region region = Region.US_WEST_1;
           public S3Client S3 = null;

           public void connect() {
               AwsBasicCredentials awsCreds = AwsBasicCredentials.create(appID, secretKey);
               S3 =  S3Client.builder().region(region).credentialsProvider(StaticCredentialsProvider.create(awsCreds))
                    .build();
           }

        private HeadObjectResponse getHead(String keyName, String bucketName) {
            HeadObjectRequest objectRequest = HeadObjectRequest.builder().key(keyName).bucket(bucketName).build();

            HeadObjectResponse objectHead = S3.headObject(objectRequest);
            return objectHead;
        }

        // public static final AmazonS3 S3 = AmazonS3ClientBuilder.defaultClient();

    }

    private LazyHolder lazyHolder = new LazyHolder();

    private final String bucket;
    private final String file;
    private final byte[] buffer;
    private long lastByteOffset;

    private long offset = 0;
    private int next = 0;
    private int length = 0;

    public S3InputStreamV2(final String bucket, final String file, final byte[] buffer, String appID, String secret) {
        this.bucket = bucket;
        this.file = file;
        this.buffer = buffer;
        lazyHolder.appID = appID;
        lazyHolder.secretKey = secret;
        lazyHolder.connect();
        this.lastByteOffset = lazyHolder.getHead(file, bucket).contentLength();
    }

    @Override
    public int read() throws IOException {
        if (next >= length || (next == buffer.length && length == buffer.length)) {
            fill();

            if (length <= 0) {
                return -1;
            }

            next = 0;
        }

        if (next >= length) {
            return -1;
        }

        return buffer[this.next++] & 0xFF;
    }

    public void fill() throws IOException {
        if (offset >= lastByteOffset) {
            length = -1;
        } else {
            try (final InputStream inputStream = s3Object()) {
                length = 0;
                int b;

                while ((b = inputStream.read()) != -1) {
                    buffer[length++] = (byte) b;
                }

                if (length > 0) {
                    offset += length;
                }
            }
        }
    }

    private InputStream s3Object() {
        final Long rangeEnd = offset + buffer.length - 1;
        final String rangeString = "bytes=" + offset + "-" + rangeEnd;
        final GetObjectRequest getObjectRequest = GetObjectRequest.builder().bucket(bucket).key(file).range(rangeString)
                .build();

        return lazyHolder.S3.getObject(getObjectRequest);
    }
}

You can read all the files in the bucket with checking the tokens.您可以通过检查令牌来读取存储桶中的所有文件。 And you can read files with other java libs.. ie Pdf.您可以使用其他 java 库读取文件.. 即 Pdf。

import java.io.IOException;
import java.io.InputStream;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.List;
import javax.swing.JTextArea;
import java.io.FileWriter;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import org.joda.time.DateTime;
import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3Client;
import com.amazonaws.services.s3.model.AmazonS3Exception;
import com.amazonaws.services.s3.model.CopyObjectRequest;
import com.amazonaws.services.s3.model.GetObjectRequest;
import com.amazonaws.services.s3.model.ListObjectsV2Request;
import com.amazonaws.services.s3.model.ListObjectsV2Result;
import com.amazonaws.services.s3.model.S3Object;
import com.amazonaws.services.s3.model.S3ObjectSummary;
import java.io.File; 
   //..
   // in your main class 
   private static AWSCredentials credentials = null;
   private static AmazonS3 amazonS3Client = null;

   public static void intializeAmazonObjects() {
        credentials = new BasicAWSCredentials(ACCESS_KEY, SECRET_ACCESS_KEY);
        amazonS3Client = new AmazonS3Client(credentials);
    }
   public void mainMethod() throws IOException, AmazonS3Exception{
        // connect to aws
        intializeAmazonObjects();

    ListObjectsV2Request req = new ListObjectsV2Request().withBucketName(bucketName);
    ListObjectsV2Result listObjectsResult;
do {

        listObjectsResult = amazonS3Client.listObjectsV2(req);
        int count = 0;
        for (S3ObjectSummary objectSummary : listObjectsResult.getObjectSummaries()) {
            System.out.printf(" - %s (size: %d)\n", objectSummary.getKey(), objectSummary.getSize());

            // Date lastModifiedDate = objectSummary.getLastModified();

            // String bucket = objectSummary.getBucketName();
            String key = objectSummary.getKey();
            String newKey = "";
            String newBucket = "";
            String resultText = "";

            // only try to read pdf files
            if (!key.contains(".pdf")) {
                continue;
            }

            // Read the source file as text
            String pdfFileInText = readAwsFile(objectSummary.getBucketName(), objectSummary.getKey());
            if (pdfFileInText.isEmpty())
                continue;
        }//end of current bulk

        // If there are more than maxKeys(in this case 999 default) keys in the bucket,
        // get a continuation token
        // and list the next objects.
        String token = listObjectsResult.getNextContinuationToken();
        System.out.println("Next Continuation Token: " + token);
        req.setContinuationToken(token);
    } while (listObjectsResult.isTruncated());
}

public String readAwsFile(String bucketName, String keyName) {
    S3Object object;
    String pdfFileInText = "";
    try {

        // AmazonS3 s3client = getAmazonS3ClientObject();
        object = amazonS3Client.getObject(new GetObjectRequest(bucketName, keyName));
        InputStream objectData = object.getObjectContent();

        PDDocument document = PDDocument.load(objectData);
        document.getClass();

        if (!document.isEncrypted()) {

            PDFTextStripperByArea stripper = new PDFTextStripperByArea();
            stripper.setSortByPosition(true);

            PDFTextStripper tStripper = new PDFTextStripper();

            pdfFileInText = tStripper.getText(document);

        }

    } catch (Exception e) {
        e.printStackTrace();
    }
    return pdfFileInText;
}

Got puzzled while we were migrating from AWS Sdk V1 to V2 and realised in V2 SDK its not the same way to define the range当我们从 AWS Sdk V1 迁移到 V2 时感到困惑,并在 V2 SDK 中意识到它定义范围的方式不同

With AWS V1 SDK使用 AWS V1 SDK

 S3Object currentS3Obj  = client.getObject(new GetObjectRequest(bucket, key).withRange(start, end));
    return currentS3Obj.getObjectContent();

With AWS V2 SDK使用 AWS V2 SDK

var range = String.format("bytes=%d-%d", start, end);
ResponseBytes<GetObjectResponse> currentS3Obj = client.getObjectAsBytes(GetObjectRequest.builder().bucket(bucket).key(key).range(range).build());
    return currentS3Obj.asInputStream();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM