[英]How to read file chunk by chunk from S3 using aws-java-sdk
我正在嘗試從 S3 中將大文件讀入塊中,而不切斷任何行以進行並行處理。
讓我舉例說明:S3 上有一個大小為 1G 的文件。 我想將這個文件分成 64 MB 的塊。 我可以很容易地做到這一點:
S3Object s3object = s3.getObject(new GetObjectRequest(bucketName, key));
InputStream stream = s3object.getObjectContent();
byte[] content = new byte[64*1024*1024];
while (stream.read(content) != -1) {
//process content here
}
但是塊的問題是它可能有 100 行完整的行和一個不完整的行。 但我無法處理不完整的行,也不想丟棄它。
有什么辦法可以處理這種情況? 表示所有夾頭都沒有偏線。
我通常的方法( InputStream
-> BufferedReader.lines()
-> batches of lines -> CompletableFuture
)在這里不起作用,因為底層的S3ObjectInputStream
最終會因大文件而超時。
所以我創建了一個新類S3InputStream
,它不關心它打開多長時間,並使用短期 AWS 開發工具包調用按需讀取字節塊。 您提供一個將被重用的byte[]
。 new byte[1 << 24]
(16Mb) 似乎運行良好。
package org.harrison;
import java.io.IOException;
import java.io.InputStream;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.GetObjectRequest;
/**
* An {@link InputStream} for S3 files that does not care how big the file is.
*
* @author stephen harrison
*/
public class S3InputStream extends InputStream {
private static class LazyHolder {
private static final AmazonS3 S3 = AmazonS3ClientBuilder.defaultClient();
}
private final String bucket;
private final String file;
private final byte[] buffer;
private long lastByteOffset;
private long offset = 0;
private int next = 0;
private int length = 0;
public S3InputStream(final String bucket, final String file, final byte[] buffer) {
this.bucket = bucket;
this.file = file;
this.buffer = buffer;
this.lastByteOffset = LazyHolder.S3.getObjectMetadata(bucket, file).getContentLength() - 1;
}
@Override
public int read() throws IOException {
if (next >= length) {
fill();
if (length <= 0) {
return -1;
}
next = 0;
}
if (next >= length) {
return -1;
}
return buffer[this.next++];
}
public void fill() throws IOException {
if (offset >= lastByteOffset) {
length = -1;
} else {
try (final InputStream inputStream = s3Object()) {
length = 0;
int b;
while ((b = inputStream.read()) != -1) {
buffer[length++] = (byte) b;
}
if (length > 0) {
offset += length;
}
}
}
}
private InputStream s3Object() {
final GetObjectRequest request = new GetObjectRequest(bucket, file).withRange(offset,
offset + buffer.length - 1);
return LazyHolder.S3.getObject(request).getObjectContent();
}
}
aws-java-sdk 已經為您的 S3 對象提供了流式處理功能。 您必須調用“getObject”,結果將是一個 InputStream。
1) AmazonS3Client.getObject(GetObjectRequest getObjectRequest) -> S3Object
2) S3Object.getObjectContent()
注意:該方法是一個簡單的 getter,實際上並不創建流。 如果您檢索 S3Object,則應盡快關閉此輸入流,因為對象內容不會緩沖在內存中,而是直接從 Amazon S3 流式傳輸。 此外,未能關閉此流可能會導致請求池被阻塞。
100 條完整線和 1 條不完整線
你的意思是你需要逐行閱讀流水線嗎? 如果是這樣,而不是使用 InputStream 嘗試使用 BufferedReader 讀取 s3 對象流,以便您可以逐行讀取流,但我認為這會比逐塊慢一點。
S3Object s3object = s3.getObject(new GetObjectRequest(bucketName, key));
BufferedReader in = new BufferedReader(new InputStreamReader(s3object.getObjectContent()));
String line;
while ((line = in.readLine()) != null) {
//process line here
}
@stephen-harrison 的答案效果很好。 我為 sdk 的 v2 更新了它。 我做了一些調整:主要是現在可以授權連接,並且 LazyHolder 類不再是靜態的——我不知道如何授權連接並仍然保持類靜態。
有關使用 Scala 的另一種方法,請參閱https://alexwlchan.net/2019/09/streaming-large-s3-objects/
package foo.whatever;
import java.io.IOException;
import java.io.InputStream;
import software.amazon.awssdk.auth.credentials.AwsBasicCredentials;
import software.amazon.awssdk.auth.credentials.StaticCredentialsProvider;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.s3.S3Client;
import software.amazon.awssdk.services.s3.model.GetObjectRequest;
import software.amazon.awssdk.services.s3.model.HeadObjectRequest;
import software.amazon.awssdk.services.s3.model.HeadObjectResponse;
/**
* Adapted for aws Java sdk v2 by jomofrodo@gmail.com
*
* An {@link InputStream} for S3 files that does not care how big the file is.
*
* @author stephen harrison
*/
public class S3InputStreamV2 extends InputStream {
private class LazyHolder {
String appID;
String secretKey;
Region region = Region.US_WEST_1;
public S3Client S3 = null;
public void connect() {
AwsBasicCredentials awsCreds = AwsBasicCredentials.create(appID, secretKey);
S3 = S3Client.builder().region(region).credentialsProvider(StaticCredentialsProvider.create(awsCreds))
.build();
}
private HeadObjectResponse getHead(String keyName, String bucketName) {
HeadObjectRequest objectRequest = HeadObjectRequest.builder().key(keyName).bucket(bucketName).build();
HeadObjectResponse objectHead = S3.headObject(objectRequest);
return objectHead;
}
// public static final AmazonS3 S3 = AmazonS3ClientBuilder.defaultClient();
}
private LazyHolder lazyHolder = new LazyHolder();
private final String bucket;
private final String file;
private final byte[] buffer;
private long lastByteOffset;
private long offset = 0;
private int next = 0;
private int length = 0;
public S3InputStreamV2(final String bucket, final String file, final byte[] buffer, String appID, String secret) {
this.bucket = bucket;
this.file = file;
this.buffer = buffer;
lazyHolder.appID = appID;
lazyHolder.secretKey = secret;
lazyHolder.connect();
this.lastByteOffset = lazyHolder.getHead(file, bucket).contentLength();
}
@Override
public int read() throws IOException {
if (next >= length || (next == buffer.length && length == buffer.length)) {
fill();
if (length <= 0) {
return -1;
}
next = 0;
}
if (next >= length) {
return -1;
}
return buffer[this.next++] & 0xFF;
}
public void fill() throws IOException {
if (offset >= lastByteOffset) {
length = -1;
} else {
try (final InputStream inputStream = s3Object()) {
length = 0;
int b;
while ((b = inputStream.read()) != -1) {
buffer[length++] = (byte) b;
}
if (length > 0) {
offset += length;
}
}
}
}
private InputStream s3Object() {
final Long rangeEnd = offset + buffer.length - 1;
final String rangeString = "bytes=" + offset + "-" + rangeEnd;
final GetObjectRequest getObjectRequest = GetObjectRequest.builder().bucket(bucket).key(file).range(rangeString)
.build();
return lazyHolder.S3.getObject(getObjectRequest);
}
}
您可以通過檢查令牌來讀取存儲桶中的所有文件。 您可以使用其他 java 庫讀取文件.. 即 Pdf。
import java.io.IOException;
import java.io.InputStream;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.List;
import javax.swing.JTextArea;
import java.io.FileWriter;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import org.joda.time.DateTime;
import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3Client;
import com.amazonaws.services.s3.model.AmazonS3Exception;
import com.amazonaws.services.s3.model.CopyObjectRequest;
import com.amazonaws.services.s3.model.GetObjectRequest;
import com.amazonaws.services.s3.model.ListObjectsV2Request;
import com.amazonaws.services.s3.model.ListObjectsV2Result;
import com.amazonaws.services.s3.model.S3Object;
import com.amazonaws.services.s3.model.S3ObjectSummary;
import java.io.File;
//..
// in your main class
private static AWSCredentials credentials = null;
private static AmazonS3 amazonS3Client = null;
public static void intializeAmazonObjects() {
credentials = new BasicAWSCredentials(ACCESS_KEY, SECRET_ACCESS_KEY);
amazonS3Client = new AmazonS3Client(credentials);
}
public void mainMethod() throws IOException, AmazonS3Exception{
// connect to aws
intializeAmazonObjects();
ListObjectsV2Request req = new ListObjectsV2Request().withBucketName(bucketName);
ListObjectsV2Result listObjectsResult;
do {
listObjectsResult = amazonS3Client.listObjectsV2(req);
int count = 0;
for (S3ObjectSummary objectSummary : listObjectsResult.getObjectSummaries()) {
System.out.printf(" - %s (size: %d)\n", objectSummary.getKey(), objectSummary.getSize());
// Date lastModifiedDate = objectSummary.getLastModified();
// String bucket = objectSummary.getBucketName();
String key = objectSummary.getKey();
String newKey = "";
String newBucket = "";
String resultText = "";
// only try to read pdf files
if (!key.contains(".pdf")) {
continue;
}
// Read the source file as text
String pdfFileInText = readAwsFile(objectSummary.getBucketName(), objectSummary.getKey());
if (pdfFileInText.isEmpty())
continue;
}//end of current bulk
// If there are more than maxKeys(in this case 999 default) keys in the bucket,
// get a continuation token
// and list the next objects.
String token = listObjectsResult.getNextContinuationToken();
System.out.println("Next Continuation Token: " + token);
req.setContinuationToken(token);
} while (listObjectsResult.isTruncated());
}
public String readAwsFile(String bucketName, String keyName) {
S3Object object;
String pdfFileInText = "";
try {
// AmazonS3 s3client = getAmazonS3ClientObject();
object = amazonS3Client.getObject(new GetObjectRequest(bucketName, keyName));
InputStream objectData = object.getObjectContent();
PDDocument document = PDDocument.load(objectData);
document.getClass();
if (!document.isEncrypted()) {
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper tStripper = new PDFTextStripper();
pdfFileInText = tStripper.getText(document);
}
} catch (Exception e) {
e.printStackTrace();
}
return pdfFileInText;
}
當我們從 AWS Sdk V1 遷移到 V2 時感到困惑,並在 V2 SDK 中意識到它定義范圍的方式不同
使用 AWS V1 SDK
S3Object currentS3Obj = client.getObject(new GetObjectRequest(bucket, key).withRange(start, end));
return currentS3Obj.getObjectContent();
使用 AWS V2 SDK
var range = String.format("bytes=%d-%d", start, end);
ResponseBytes<GetObjectResponse> currentS3Obj = client.getObjectAsBytes(GetObjectRequest.builder().bucket(bucket).key(key).range(range).build());
return currentS3Obj.asInputStream();
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.