[英]Amazon s3 returns only 1000 entries for one bucket and all for another bucket (using java sdk)?
I am using below mentioned code to get list of all file names from s3 bucket.我正在使用下面提到的代码从 s3 存储桶中获取所有文件名的列表。 I have two bucket in s3.
我在 s3 中有两个桶。 For one of the bucket below code returns all the file names (more than 1000), but the same code returns only 1000 file names for another bucket.
对于下面的一个存储桶,代码返回所有文件名(超过 1000 个),但相同的代码只为另一个存储桶返回 1000 个文件名。 I just don't get what is happening.
我只是不明白发生了什么。 Why same code running for one bucket and not for other?
为什么相同的代码只针对一个桶而不是其他桶运行?
Also my bucket have hierarchy structure folder/filename.jpg.我的桶也有层次结构文件夹/文件名.jpg。
ObjectListing objects = s3.listObjects("bucket.new.test");
do {
for (S3ObjectSummary objectSummary : objects.getObjectSummaries()) {
String key = objectSummary.getKey();
System.out.println(key);
}
objects = s3.listNextBatchOfObjects(objects);
} while (objects.isTruncated());
Improving on @Abhishek's answer.改进@Abhishek 的回答。 This code is slightly shorter and variable names are fixed.
这段代码略短,变量名是固定的。
You have to get the object listing, add its' contents to the collection, then get the next batch of objects from the listing.
您必须获取对象列表,将其内容添加到集合中,然后从列表中获取下一批对象。 Repeat the operation until the listing will not be truncated.
重复该操作,直到列表不会被截断。
List<S3ObjectSummary> keyList = new ArrayList<S3ObjectSummary>();
ObjectListing objects = s3.listObjects("bucket.new.test");
keyList.addAll(objects.getObjectSummaries());
while (objects.isTruncated()) {
objects = s3.listNextBatchOfObjects(objects);
keyList.addAll(objects.getObjectSummaries());
}
For Scala developers, here it is recursive function to execute a full scan and map of the contents of an AmazonS3 bucket using the official AWS SDK for Java对于 Scala 开发人员,这里是使用官方AWS SDK for Java执行 AmazonS3 存储桶内容的完整扫描和映射的递归函数
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.{S3ObjectSummary, ObjectListing, GetObjectRequest}
import scala.collection.JavaConversions.{collectionAsScalaIterable => asScala}
def map[T](s3: AmazonS3Client, bucket: String, prefix: String)(f: (S3ObjectSummary) => T) = {
def scan(acc:List[T], listing:ObjectListing): List[T] = {
val summaries = asScala[S3ObjectSummary](listing.getObjectSummaries())
val mapped = (for (summary <- summaries) yield f(summary)).toList
if (!listing.isTruncated) mapped.toList
else scan(acc ::: mapped, s3.listNextBatchOfObjects(listing))
}
scan(List(), s3.listObjects(bucket, prefix))
}
To invoke the above curried map()
function, simply pass the already constructed (and properly initialized) AmazonS3Client object (refer to the officialAWS SDK for Java API Reference ), the bucket name and the prefix name in the first parameter list.要调用上述柯里化的
map()
函数,只需在第一个参数列表中传递已经构建(并正确初始化)的 AmazonS3Client 对象(请参阅官方AWS SDK for Java API 参考)、存储桶名称和前缀名称。 Also pass the function f()
you want to apply to map each object summary in the second parameter list.还传递要应用的函数
f()
以映射第二个参数列表中的每个对象摘要。
For example例如
val keyOwnerTuples = map(s3, bucket, prefix)(s => (s.getKey, s.getOwner))
will return the full list of (key, owner)
tuples in that bucket/prefix将返回该存储桶/前缀中的
(key, owner)
元组的完整列表
or或者
map(s3, "bucket", "prefix")(s => println(s))
as you would normally approach byMonads in Functional Programming就像你在函数式编程中通常通过Monads接近的那样
I have just changed above code to use addAll instead of using a for loop to add objects one by one and it worked for me:我刚刚将上面的代码更改为使用addAll而不是使用for循环来一个一个地添加对象,它对我有用:
List<S3ObjectSummary> keyList = new ArrayList<S3ObjectSummary>();
ObjectListing object = s3.listObjects("bucket.new.test");
keyList = object.getObjectSummaries();
object = s3.listNextBatchOfObjects(object);
while (object.isTruncated()){
keyList.addAll(current.getObjectSummaries());
object = s3.listNextBatchOfObjects(current);
}
keyList.addAll(object.getObjectSummaries());
After that you can simply use any iterator over list keyList .之后,您可以简单地在列表keyList 上使用任何迭代器。
If you want to get all of object (more than 1000 keys) you need to send another packet with the last key to S3.如果您想获取所有对象(超过 1000 个密钥),您需要将带有最后一个密钥的另一个数据包发送到 S3。 Here is the code.
这是代码。
private static String lastKey = "";
private static String preLastKey = "";
...
do{
preLastKey = lastKey;
AmazonS3 s3 = new AmazonS3Client(new ClasspathPropertiesFileCredentialsProvider());
String bucketName = "bucketname";
ListObjectsRequest lstRQ = new ListObjectsRequest().withBucketName(bucketName).withPrefix("");
lstRQ.setMarker(lastKey);
ObjectListing objectListing = s3.listObjects(lstRQ);
// loop and get file on S3
for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries()) {
// get oject and do something.....
}
}while(lastKey != preLastKey);
In Scala:在斯卡拉:
val first = s3.listObjects("bucket.new.test")
val listings: Seq[ObjectListing] = Iterator.iterate(Option(first))(_.flatMap(listing =>
if (listing.isTruncated) Some(s3.listNextBatchOfObjects(listing))
else None
))
.takeWhile(_.nonEmpty)
.toList
.flatten
An alternative way by using recursive method使用递归方法的另一种方法
/**
* A recursive method to wrap {@link AmazonS3} listObjectsV2 method.
* <p>
* By default, ListObjectsV2 can only return some or all (UP TO 1,000) of the objects in a bucket per request.
* Ref: https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html
* <p>
* However, this method can return unlimited {@link S3ObjectSummary} for each request.
*
* @param request
* @return
*/
private List<S3ObjectSummary> getS3ObjectSummaries(final ListObjectsV2Request request) {
final ListObjectsV2Result result = s3Client.listObjectsV2(request);
final List<S3ObjectSummary> resultSummaries = result.getObjectSummaries();
if (result.isTruncated() && isNotBlank(result.getNextContinuationToken())) {
final ListObjectsV2Request nextRequest = request.withContinuationToken(result.getNextContinuationToken());
final List<S3ObjectSummary> nextResultSummaries = this.getS3ObjectSummaries(nextRequest);
resultSummaries.addAll(nextResultSummaries);
}
return resultSummaries;
}
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.{ObjectListing, S3ObjectSummary}
import scala.collection.JavaConverters._
import scala.collection.mutable.ListBuffer
def map[T](s3: AmazonS3Client, bucket: String, prefix: String)(f: (S3ObjectSummary) => T): List[T] = {
def scan(acc: ListBuffer[T], listing: ObjectListing): List[T] = {
val r = acc ++= listing.getObjectSummaries.asScala.map(f).toList
if (listing.isTruncated) scan(r, s3.listNextBatchOfObjects(listing))
else r.toList
}
scan(ListBuffer.empty[T], s3.listObjects(bucket, prefix))
}
The second method is to use awssdk-v2第二种方法是使用awssdk-v2
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>s3</artifactId>
<version>2.1.0</version>
</dependency>
import software.amazon.awssdk.services.s3.S3Client
import software.amazon.awssdk.services.s3.model.{ListObjectsV2Request, S3Object}
import scala.collection.JavaConverters._
def listObjects[T](s3: S3Client, bucket: String,
prefix: String, startAfter: String)(f: (S3Object) => T): List[T] = {
val request = ListObjectsV2Request.builder()
.bucket(bucket).prefix(prefix)
.startAfter(startAfter).build()
s3.listObjectsV2Paginator(request)
.asScala
.flatMap(_.contents().asScala)
.map(f)
.toList
}
By default the API returns up to 1,000 key names.默认情况下,API 最多返回 1,000 个键名。 The response might contain fewer keys but will never contain more.
响应可能包含更少的键,但永远不会包含更多。 A better implementation would be use the newer ListObjectsV2 API:
更好的实现是使用较新的 ListObjectsV2 API:
List<S3ObjectSummary> docList=new ArrayList<>();
ListObjectsV2Request req = new ListObjectsV2Request().withBucketName(bucketName).withPrefix(folderFullPath);
ListObjectsV2Result listing;
do{
listing=this.getAmazonS3Client().listObjectsV2(req);
docList.addAll(listing.getObjectSummaries());
String token = listing.getNextContinuationToken();
req.setContinuationToken(token);
LOG.info("Next Continuation Token for listing documents is :"+token);
}while (listing.isTruncated());
The code given by @oferei works good and I upvote that code. @oferei 给出的代码运行良好,我赞成该代码。 But I want to point out the root issue with the @Abhishek's code.
但我想指出@Abhishek 代码的根本问题。 Actually, the problem is with your do while loop.
实际上,问题在于您的 do while 循环。
If you carefully observe, you are fetching the next batch of objects in the second last statement and then you check is you have exhausted the total list of files.如果您仔细观察,您将在倒数第二个语句中获取下一批对象,然后您检查是否已经用尽了整个文件列表。 So, when you fetch the last batch, isTruncated() becomes false and you break out of loop and don't process the last X%1000 records.
因此,当您获取最后一批时, isTruncated() 变为 false 并且您跳出循环并且不处理最后的 X%1000 条记录。 For eg: if in total you had 2123 records, you will end up fetching 1000 and then 1000 ie 2000 records.
例如:如果您总共有 2123 条记录,您最终将获取 1000 条记录,然后是 1000 条记录,即 2000 条记录。 You will miss the 123 records because your isTruncated value will break the loop as you are processing the next batch after checking the isTruncated value.
您将错过 123 条记录,因为您的 isTruncated 值会在您检查 isTruncated 值后处理下一批时打破循环。
Apologies I cant post a comment, else I would have commented on the upvoted answer.抱歉,我无法发表评论,否则我会对已投票的答案发表评论。
The reason you are getting only first 1000 objects, because thats how listObjects
is desgined to work.您只获得前 1000 个对象的原因是,这就是
listObjects
设计的工作方式。
This is from its JavaDoc这是来自它的 JavaDoc
Returns some or all (up to 1,000) of the objects in a bucket with each request.
You can use the request parameters as selection criteria to return a subset of the objects in a bucket.
A 200 OK response can contain valid or invalid XML. Make sure to design your application to parse the contents of the response and handle it appropriately.
Objects are returned sorted in an ascending order of the respective key names in the list. For more information about listing objects, see Listing object keys programmatically
To get paginated results automatically, use listObjectsV2Paginator
method要自动获取分页结果,请使用
listObjectsV2Paginator
方法
ListObjectsV2Request listReq = ListObjectsV2Request.builder()
.bucket(bucketName)
.maxKeys(1)
.build();
ListObjectsV2Iterable listRes = s3.listObjectsV2Paginator(listReq);
// Helper method to work with paginated collection of items directly
listRes.contents().stream()
.forEach(content -> System.out.println(" Key: " + content.key() + " size = " + content.size()));
You can opt for manual pagination as well if needed.如果需要,您也可以选择手动分页。
Reference: https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/pagination.html参考: https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/pagination.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.