I tried using boto, but it has the .list()
method, which takes way to long for my data set, and the .get_all_keys()
, which gets it random. I want to get about 100-1000 of the most recent keys in my S3 bucket, which has millions of keys in it. What is the most efficient way of doing this.
import boto3
client = boto3.client('s3')
start_after = “”
response = client.list_objects(Bucket='<bucket>', StartAfter =start_after,MaxKeys=1000)
you save response['Contents'] which has LastModified key.
'Contents': [
{
'Key': 'string',
'LastModified': datetime(2015, 1, 1),
'ETag': 'string',
'Size': 123,
'StorageClass': 'STANDARD'|'REDUCED_REDUNDANCY'|'GLACIER'|'STANDARD_IA'|'ONEZONE_IA',
'Owner': {
'DisplayName': 'string',
'ID': 'string'
}
},
],
get the last key from this 1000 records and assign this to the start_after variable and make another requests this time.
The new request start fetching keys that are after the key startAfter provides.
https://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects_v2
If you don't mind the data being slightly out of date, you could use Amazon S3 Inventory , which can provide a daily CSV file listing all of your objects in the Amazon S3 bucket:
Amazon S3 inventory provides comma-separated values (CSV) or Apache optimized row columnar (ORC) output files that list your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix (that is, objects that have names that begin with a common string).
You could parse this file to obtain Keys and Last Modified dates, then sort by date.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.