简体   繁体   中英

Python: getting the first 100 most recent keys in amazon s3 bucket

I tried using boto, but it has the .list() method, which takes way to long for my data set, and the .get_all_keys() , which gets it random. I want to get about 100-1000 of the most recent keys in my S3 bucket, which has millions of keys in it. What is the most efficient way of doing this.

import boto3

client = boto3.client('s3')

start_after = “”

response =  client.list_objects(Bucket='<bucket>', StartAfter =start_after,MaxKeys=1000)

you save response['Contents'] which has LastModified key.

'Contents': [
    {
        'Key': 'string',
        'LastModified': datetime(2015, 1, 1),
        'ETag': 'string',
        'Size': 123,
        'StorageClass': 'STANDARD'|'REDUCED_REDUNDANCY'|'GLACIER'|'STANDARD_IA'|'ONEZONE_IA',
        'Owner': {
            'DisplayName': 'string',
            'ID': 'string'
        }
    },
],

get the last key from this 1000 records and assign this to the start_after variable and make another requests this time.

The new request start fetching keys that are after the key startAfter provides.

https://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects_v2

If you don't mind the data being slightly out of date, you could use Amazon S3 Inventory , which can provide a daily CSV file listing all of your objects in the Amazon S3 bucket:

Amazon S3 inventory provides comma-separated values (CSV) or Apache optimized row columnar (ORC) output files that list your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix (that is, objects that have names that begin with a common string).

You could parse this file to obtain Keys and Last Modified dates, then sort by date.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM