简体   繁体   中英

Amazon s3 returns only 1000 entries for one bucket and all for another bucket (using java sdk)?

I am using below mentioned code to get list of all file names from s3 bucket. I have two bucket in s3. For one of the bucket below code returns all the file names (more than 1000), but the same code returns only 1000 file names for another bucket. I just don't get what is happening. Why same code running for one bucket and not for other?

Also my bucket have hierarchy structure folder/filename.jpg.

ObjectListing objects = s3.listObjects("bucket.new.test");
do {
    for (S3ObjectSummary objectSummary : objects.getObjectSummaries()) {
        String key = objectSummary.getKey();
        System.out.println(key);
    }
    objects = s3.listNextBatchOfObjects(objects);
} while (objects.isTruncated());

Improving on @Abhishek's answer. This code is slightly shorter and variable names are fixed.

You have to get the object listing, add its' contents to the collection, then get the next batch of objects from the listing. Repeat the operation until the listing will not be truncated.

List<S3ObjectSummary> keyList = new ArrayList<S3ObjectSummary>();
ObjectListing objects = s3.listObjects("bucket.new.test");
keyList.addAll(objects.getObjectSummaries());

while (objects.isTruncated()) {
    objects = s3.listNextBatchOfObjects(objects);
    keyList.addAll(objects.getObjectSummaries());
}

For Scala developers, here it is recursive function to execute a full scan and map of the contents of an AmazonS3 bucket using the official AWS SDK for Java

import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.{S3ObjectSummary, ObjectListing, GetObjectRequest}
import scala.collection.JavaConversions.{collectionAsScalaIterable => asScala}

def map[T](s3: AmazonS3Client, bucket: String, prefix: String)(f: (S3ObjectSummary) => T) = {

  def scan(acc:List[T], listing:ObjectListing): List[T] = {
    val summaries = asScala[S3ObjectSummary](listing.getObjectSummaries())
    val mapped = (for (summary <- summaries) yield f(summary)).toList

    if (!listing.isTruncated) mapped.toList
    else scan(acc ::: mapped, s3.listNextBatchOfObjects(listing))
  }

  scan(List(), s3.listObjects(bucket, prefix))
}

To invoke the above curried map() function, simply pass the already constructed (and properly initialized) AmazonS3Client object (refer to the officialAWS SDK for Java API Reference ), the bucket name and the prefix name in the first parameter list. Also pass the function f() you want to apply to map each object summary in the second parameter list.

For example

val keyOwnerTuples = map(s3, bucket, prefix)(s => (s.getKey, s.getOwner))

will return the full list of (key, owner) tuples in that bucket/prefix

or

map(s3, "bucket", "prefix")(s => println(s))

as you would normally approach byMonads in Functional Programming

I have just changed above code to use addAll instead of using a for loop to add objects one by one and it worked for me:

List<S3ObjectSummary> keyList = new ArrayList<S3ObjectSummary>();
ObjectListing object = s3.listObjects("bucket.new.test");
keyList = object.getObjectSummaries();
object = s3.listNextBatchOfObjects(object);

while (object.isTruncated()){
  keyList.addAll(current.getObjectSummaries());
  object = s3.listNextBatchOfObjects(current);
}
keyList.addAll(object.getObjectSummaries());

After that you can simply use any iterator over list keyList .

If you want to get all of object (more than 1000 keys) you need to send another packet with the last key to S3. Here is the code.

private static String lastKey = "";
private static String preLastKey = "";
...

do{
        preLastKey = lastKey;
        AmazonS3 s3 = new AmazonS3Client(new ClasspathPropertiesFileCredentialsProvider());

        String bucketName = "bucketname";           

        ListObjectsRequest lstRQ = new ListObjectsRequest().withBucketName(bucketName).withPrefix("");  

        lstRQ.setMarker(lastKey);  

        ObjectListing objectListing = s3.listObjects(lstRQ);

        //  loop and get file on S3
        for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries()) {
             //   get oject and do something.....
        }
}while(lastKey != preLastKey);

In Scala:

val first = s3.listObjects("bucket.new.test")

val listings: Seq[ObjectListing] = Iterator.iterate(Option(first))(_.flatMap(listing =>
  if (listing.isTruncated) Some(s3.listNextBatchOfObjects(listing))
  else None
))
  .takeWhile(_.nonEmpty)
  .toList
  .flatten

An alternative way by using recursive method

/**
 * A recursive method to wrap {@link AmazonS3} listObjectsV2 method.
 * <p>
 * By default, ListObjectsV2 can only return some or all (UP TO 1,000) of the objects in a bucket per request.
 * Ref: https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html
 * <p>
 * However, this method can return unlimited {@link S3ObjectSummary} for each request.
 *
 * @param request
 * @return
 */
private List<S3ObjectSummary> getS3ObjectSummaries(final ListObjectsV2Request request) {
    final ListObjectsV2Result result = s3Client.listObjectsV2(request);
    final List<S3ObjectSummary> resultSummaries = result.getObjectSummaries();
    if (result.isTruncated() && isNotBlank(result.getNextContinuationToken())) {
        final ListObjectsV2Request nextRequest = request.withContinuationToken(result.getNextContinuationToken());
        final List<S3ObjectSummary> nextResultSummaries = this.getS3ObjectSummaries(nextRequest);
        resultSummaries.addAll(nextResultSummaries);
    }
    return resultSummaries;
}
  1. Paolo Angioletti's code can't get all the data, only the last batch of data.
  2. I think it might be better to use ListBuffer.
  3. This method does not support setting startAfterKey.
    import com.amazonaws.services.s3.AmazonS3Client
    import com.amazonaws.services.s3.model.{ObjectListing, S3ObjectSummary}    
    import scala.collection.JavaConverters._
    import scala.collection.mutable.ListBuffer

    def map[T](s3: AmazonS3Client, bucket: String, prefix: String)(f: (S3ObjectSummary) => T): List[T] = {

      def scan(acc: ListBuffer[T], listing: ObjectListing): List[T] = {
        val r = acc ++= listing.getObjectSummaries.asScala.map(f).toList
        if (listing.isTruncated) scan(r, s3.listNextBatchOfObjects(listing))
        else r.toList
      }

      scan(ListBuffer.empty[T], s3.listObjects(bucket, prefix))
    }

The second method is to use awssdk-v2

<dependency>
    <groupId>software.amazon.awssdk</groupId>
    <artifactId>s3</artifactId>
    <version>2.1.0</version>
</dependency>
  import software.amazon.awssdk.services.s3.S3Client
  import software.amazon.awssdk.services.s3.model.{ListObjectsV2Request, S3Object}

  import scala.collection.JavaConverters._

  def listObjects[T](s3: S3Client, bucket: String,
                     prefix: String, startAfter: String)(f: (S3Object) => T): List[T] = {
    val request = ListObjectsV2Request.builder()
      .bucket(bucket).prefix(prefix)
      .startAfter(startAfter).build()

    s3.listObjectsV2Paginator(request)
      .asScala
      .flatMap(_.contents().asScala)
      .map(f)
      .toList
  }

By default the API returns up to 1,000 key names. The response might contain fewer keys but will never contain more. A better implementation would be use the newer ListObjectsV2 API:

List<S3ObjectSummary> docList=new ArrayList<>();
    ListObjectsV2Request req = new ListObjectsV2Request().withBucketName(bucketName).withPrefix(folderFullPath);
    ListObjectsV2Result listing;
    do{
        listing=this.getAmazonS3Client().listObjectsV2(req);
        docList.addAll(listing.getObjectSummaries());
        String token = listing.getNextContinuationToken();
        req.setContinuationToken(token);
        LOG.info("Next Continuation Token for listing documents is :"+token);
    }while (listing.isTruncated());

The code given by @oferei works good and I upvote that code. But I want to point out the root issue with the @Abhishek's code. Actually, the problem is with your do while loop.

If you carefully observe, you are fetching the next batch of objects in the second last statement and then you check is you have exhausted the total list of files. So, when you fetch the last batch, isTruncated() becomes false and you break out of loop and don't process the last X%1000 records. For eg: if in total you had 2123 records, you will end up fetching 1000 and then 1000 ie 2000 records. You will miss the 123 records because your isTruncated value will break the loop as you are processing the next batch after checking the isTruncated value.

Apologies I cant post a comment, else I would have commented on the upvoted answer.

The reason you are getting only first 1000 objects, because thats how listObjects is desgined to work.

This is from its JavaDoc

Returns some or all (up to 1,000) of the objects in a bucket with each request. 
You can use the request parameters as selection criteria to return a subset of the objects in a bucket. 
A 200 OK response can contain valid or invalid XML. Make sure to design your application to parse the contents of the response and handle it appropriately. 
Objects are returned sorted in an ascending order of the respective key names in the list. For more information about listing objects, see Listing object keys programmatically 

To get paginated results automatically, use listObjectsV2Paginator method

ListObjectsV2Request listReq = ListObjectsV2Request.builder()
                .bucket(bucketName)
                .maxKeys(1)
                .build();

        ListObjectsV2Iterable listRes = s3.listObjectsV2Paginator(listReq);
 // Helper method to work with paginated collection of items directly
        listRes.contents().stream()
                .forEach(content -> System.out.println(" Key: " + content.key() + " size = " + content.size()));

You can opt for manual pagination as well if needed.

Reference: https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/pagination.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM