简体   繁体   English

Amazon s3 只为一个桶返回 1000 个条目,而为另一个桶返回所有条目(使用 java sdk)?

[英]Amazon s3 returns only 1000 entries for one bucket and all for another bucket (using java sdk)?

I am using below mentioned code to get list of all file names from s3 bucket.我正在使用下面提到的代码从 s3 存储桶中获取所有文件名的列表。 I have two bucket in s3.我在 s3 中有两个桶。 For one of the bucket below code returns all the file names (more than 1000), but the same code returns only 1000 file names for another bucket.对于下面的一个存储桶,代码返回所有文件名(超过 1000 个),但相同的代码只为另一个存储桶返回 1000 个文件名。 I just don't get what is happening.我只是不明白发生了什么。 Why same code running for one bucket and not for other?为什么相同的代码只针对一个桶而不是其他桶运行?

Also my bucket have hierarchy structure folder/filename.jpg.我的桶也有层次结构文件夹/文件名.jpg。

ObjectListing objects = s3.listObjects("bucket.new.test");
do {
    for (S3ObjectSummary objectSummary : objects.getObjectSummaries()) {
        String key = objectSummary.getKey();
        System.out.println(key);
    }
    objects = s3.listNextBatchOfObjects(objects);
} while (objects.isTruncated());

Improving on @Abhishek's answer.改进@Abhishek 的回答。 This code is slightly shorter and variable names are fixed.这段代码略短,变量名是固定的。

You have to get the object listing, add its' contents to the collection, then get the next batch of objects from the listing.您必须获取对象列表,将其内容添加到集合中,然后从列表中获取下一批对象。 Repeat the operation until the listing will not be truncated.重复该操作,直到列表不会被截断。

List<S3ObjectSummary> keyList = new ArrayList<S3ObjectSummary>();
ObjectListing objects = s3.listObjects("bucket.new.test");
keyList.addAll(objects.getObjectSummaries());

while (objects.isTruncated()) {
    objects = s3.listNextBatchOfObjects(objects);
    keyList.addAll(objects.getObjectSummaries());
}

For Scala developers, here it is recursive function to execute a full scan and map of the contents of an AmazonS3 bucket using the official AWS SDK for Java对于 Scala 开发人员,这里是使用官方AWS SDK for Java执行 AmazonS3 存储桶内容的完整扫描和映射的递归函数

import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.{S3ObjectSummary, ObjectListing, GetObjectRequest}
import scala.collection.JavaConversions.{collectionAsScalaIterable => asScala}

def map[T](s3: AmazonS3Client, bucket: String, prefix: String)(f: (S3ObjectSummary) => T) = {

  def scan(acc:List[T], listing:ObjectListing): List[T] = {
    val summaries = asScala[S3ObjectSummary](listing.getObjectSummaries())
    val mapped = (for (summary <- summaries) yield f(summary)).toList

    if (!listing.isTruncated) mapped.toList
    else scan(acc ::: mapped, s3.listNextBatchOfObjects(listing))
  }

  scan(List(), s3.listObjects(bucket, prefix))
}

To invoke the above curried map() function, simply pass the already constructed (and properly initialized) AmazonS3Client object (refer to the officialAWS SDK for Java API Reference ), the bucket name and the prefix name in the first parameter list.要调用上述柯里化的map()函数,只需在第一个参数列表中传递已经构建(并正确初始化)的 AmazonS3Client 对象(请参阅官方AWS SDK for Java API 参考)、存储桶名称和前缀名称。 Also pass the function f() you want to apply to map each object summary in the second parameter list.还传递要应用的函数f()以映射第二个参数列表中的每个对象摘要。

For example例如

val keyOwnerTuples = map(s3, bucket, prefix)(s => (s.getKey, s.getOwner))

will return the full list of (key, owner) tuples in that bucket/prefix将返回该存储桶/前缀中的(key, owner)元组的完整列表

or或者

map(s3, "bucket", "prefix")(s => println(s))

as you would normally approach byMonads in Functional Programming就像你在函数式编程中通常通过Monads接近的那样

I have just changed above code to use addAll instead of using a for loop to add objects one by one and it worked for me:我刚刚将上面的代码更改为使用addAll而不是使用for循环来一个一个地添加对象,它对我有用

List<S3ObjectSummary> keyList = new ArrayList<S3ObjectSummary>();
ObjectListing object = s3.listObjects("bucket.new.test");
keyList = object.getObjectSummaries();
object = s3.listNextBatchOfObjects(object);

while (object.isTruncated()){
  keyList.addAll(current.getObjectSummaries());
  object = s3.listNextBatchOfObjects(current);
}
keyList.addAll(object.getObjectSummaries());

After that you can simply use any iterator over list keyList .之后,您可以简单地在列表keyList 上使用任何迭代器。

If you want to get all of object (more than 1000 keys) you need to send another packet with the last key to S3.如果您想获取所有对象(超过 1000 个密钥),您需要将带有最后一个密钥的另一个数据包发送到 S3。 Here is the code.这是代码。

private static String lastKey = "";
private static String preLastKey = "";
...

do{
        preLastKey = lastKey;
        AmazonS3 s3 = new AmazonS3Client(new ClasspathPropertiesFileCredentialsProvider());

        String bucketName = "bucketname";           

        ListObjectsRequest lstRQ = new ListObjectsRequest().withBucketName(bucketName).withPrefix("");  

        lstRQ.setMarker(lastKey);  

        ObjectListing objectListing = s3.listObjects(lstRQ);

        //  loop and get file on S3
        for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries()) {
             //   get oject and do something.....
        }
}while(lastKey != preLastKey);

In Scala:在斯卡拉:

val first = s3.listObjects("bucket.new.test")

val listings: Seq[ObjectListing] = Iterator.iterate(Option(first))(_.flatMap(listing =>
  if (listing.isTruncated) Some(s3.listNextBatchOfObjects(listing))
  else None
))
  .takeWhile(_.nonEmpty)
  .toList
  .flatten

An alternative way by using recursive method使用递归方法的另一种方法

/**
 * A recursive method to wrap {@link AmazonS3} listObjectsV2 method.
 * <p>
 * By default, ListObjectsV2 can only return some or all (UP TO 1,000) of the objects in a bucket per request.
 * Ref: https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html
 * <p>
 * However, this method can return unlimited {@link S3ObjectSummary} for each request.
 *
 * @param request
 * @return
 */
private List<S3ObjectSummary> getS3ObjectSummaries(final ListObjectsV2Request request) {
    final ListObjectsV2Result result = s3Client.listObjectsV2(request);
    final List<S3ObjectSummary> resultSummaries = result.getObjectSummaries();
    if (result.isTruncated() && isNotBlank(result.getNextContinuationToken())) {
        final ListObjectsV2Request nextRequest = request.withContinuationToken(result.getNextContinuationToken());
        final List<S3ObjectSummary> nextResultSummaries = this.getS3ObjectSummaries(nextRequest);
        resultSummaries.addAll(nextResultSummaries);
    }
    return resultSummaries;
}
  1. Paolo Angioletti's code can't get all the data, only the last batch of data. Paolo Angioletti 的代码无法获取所有数据,只能获取最后一批数据。
  2. I think it might be better to use ListBuffer.我认为使用 ListBuffer 可能会更好。
  3. This method does not support setting startAfterKey.此方法不支持设置 startAfterKey。
    import com.amazonaws.services.s3.AmazonS3Client
    import com.amazonaws.services.s3.model.{ObjectListing, S3ObjectSummary}    
    import scala.collection.JavaConverters._
    import scala.collection.mutable.ListBuffer

    def map[T](s3: AmazonS3Client, bucket: String, prefix: String)(f: (S3ObjectSummary) => T): List[T] = {

      def scan(acc: ListBuffer[T], listing: ObjectListing): List[T] = {
        val r = acc ++= listing.getObjectSummaries.asScala.map(f).toList
        if (listing.isTruncated) scan(r, s3.listNextBatchOfObjects(listing))
        else r.toList
      }

      scan(ListBuffer.empty[T], s3.listObjects(bucket, prefix))
    }

The second method is to use awssdk-v2第二种方法是使用awssdk-v2

<dependency>
    <groupId>software.amazon.awssdk</groupId>
    <artifactId>s3</artifactId>
    <version>2.1.0</version>
</dependency>
  import software.amazon.awssdk.services.s3.S3Client
  import software.amazon.awssdk.services.s3.model.{ListObjectsV2Request, S3Object}

  import scala.collection.JavaConverters._

  def listObjects[T](s3: S3Client, bucket: String,
                     prefix: String, startAfter: String)(f: (S3Object) => T): List[T] = {
    val request = ListObjectsV2Request.builder()
      .bucket(bucket).prefix(prefix)
      .startAfter(startAfter).build()

    s3.listObjectsV2Paginator(request)
      .asScala
      .flatMap(_.contents().asScala)
      .map(f)
      .toList
  }

By default the API returns up to 1,000 key names.默认情况下,API 最多返回 1,000 个键名。 The response might contain fewer keys but will never contain more.响应可能包含更少的键,但永远不会包含更多。 A better implementation would be use the newer ListObjectsV2 API:更好的实现是使用较新的 ListObjectsV2 API:

List<S3ObjectSummary> docList=new ArrayList<>();
    ListObjectsV2Request req = new ListObjectsV2Request().withBucketName(bucketName).withPrefix(folderFullPath);
    ListObjectsV2Result listing;
    do{
        listing=this.getAmazonS3Client().listObjectsV2(req);
        docList.addAll(listing.getObjectSummaries());
        String token = listing.getNextContinuationToken();
        req.setContinuationToken(token);
        LOG.info("Next Continuation Token for listing documents is :"+token);
    }while (listing.isTruncated());

The code given by @oferei works good and I upvote that code. @oferei 给出的代码运行良好,我赞成该代码。 But I want to point out the root issue with the @Abhishek's code.但我想指出@Abhishek 代码的根本问题。 Actually, the problem is with your do while loop.实际上,问题在于您的 do while 循环。

If you carefully observe, you are fetching the next batch of objects in the second last statement and then you check is you have exhausted the total list of files.如果您仔细观察,您将在倒数第二个语句中获取下一批对象,然后您检查是否已经用尽了整个文件列表。 So, when you fetch the last batch, isTruncated() becomes false and you break out of loop and don't process the last X%1000 records.因此,当您获取最后一批时, isTruncated() 变为 false 并且您跳出循环并且不处理最后的 X%1000 条记录。 For eg: if in total you had 2123 records, you will end up fetching 1000 and then 1000 ie 2000 records.例如:如果您总共有 2123 条记录,您最终将获取 1000 条记录,然后是 1000 条记录,即 2000 条记录。 You will miss the 123 records because your isTruncated value will break the loop as you are processing the next batch after checking the isTruncated value.您将错过 123 条记录,因为您的 isTruncated 值会在您检查 isTruncated 值后处理下一批时打破循环。

Apologies I cant post a comment, else I would have commented on the upvoted answer.抱歉,我无法发表评论,否则我会对已投票的答案发表评论。

The reason you are getting only first 1000 objects, because thats how listObjects is desgined to work.您只获得前 1000 个对象的原因是,这就是listObjects设计的工作方式。

This is from its JavaDoc这是来自它的 JavaDoc

Returns some or all (up to 1,000) of the objects in a bucket with each request. 
You can use the request parameters as selection criteria to return a subset of the objects in a bucket. 
A 200 OK response can contain valid or invalid XML. Make sure to design your application to parse the contents of the response and handle it appropriately. 
Objects are returned sorted in an ascending order of the respective key names in the list. For more information about listing objects, see Listing object keys programmatically 

To get paginated results automatically, use listObjectsV2Paginator method要自动获取分页结果,请使用listObjectsV2Paginator方法

ListObjectsV2Request listReq = ListObjectsV2Request.builder()
                .bucket(bucketName)
                .maxKeys(1)
                .build();

        ListObjectsV2Iterable listRes = s3.listObjectsV2Paginator(listReq);
 // Helper method to work with paginated collection of items directly
        listRes.contents().stream()
                .forEach(content -> System.out.println(" Key: " + content.key() + " size = " + content.size()));

You can opt for manual pagination as well if needed.如果需要,您也可以选择手动分页。

Reference: https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/pagination.html参考: https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/pagination.html

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 Java (Amazon S3) 将 all.txt 文件从一个 object 复制到另一个但在同一个存储桶中 - Copy all .txt files from one object to another but in the same bucket using Java (Amazon S3) 如何将所有对象从一个 Amazon S3 存储桶复制到另一个存储桶? - How can I copy all objects from one Amazon S3 bucket to another bucket? 是否可以在不使用存储桶策略的情况下将 s3 存储桶内容从一个存储桶复制到另一个帐户 s3 存储桶? - is it possible to copy s3 bucket content from one bucket to another account s3 bucket without using bucket policy? 使用 PHP 将 object 从一个 Amazon S3 存储桶复制到另一个存储桶 - Copying an object from one Amazon S3 bucket to another using PHP 如何使用 scala 和 aws-java-sdk 从 S3 存储桶中获取所有 S3ObjectSummary? - How to get all S3ObjectSummary from an S3 bucket using scala and the aws-java-sdk? 亚马逊 s3 将桶与桶策略 - Amazon s3 put bucket with bucket policy 如何在 C# SDK 中执行 Amazon S3 bucket copy - How to perform Amazon S3 bucket copy in C# SDK 为什么 amazon s3 cli 复制比 java sdk 在同一个桶中复制快得多? - Why amazon s3 cli copy much faster than java sdk in the same bucket? 将文件从一个 AWS 帐户的 S3 存储桶复制到另一个 AWS 帐户的 S3 存储桶 + 使用 NodeJS - Copy files from one AWS account's S3 bucket to another AWS account's S3 bucket + using NodeJS 仅从 Amazon S3 存储桶中提取文件名 - Extract only file names from an Amazon S3 bucket
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM