简体   繁体   English

如何使用 boto3 从 AWS S3 存储桶下载最新的 n 个项目?

[英]How to download latest n items from AWS S3 bucket using boto3?

I have an S3 bucket where my application saves some final result DataFrames as.csv files.我有一个 S3 存储桶,我的应用程序将一些最终结果 DataFrames 保存为.csv 文件。 I would like to download the latest 1000 files in this bucket, but I don't know how to do it.我想下载这个桶里最新的1000个文件,但是不知道怎么做。

I cannot do it mannualy, as the bucket doesn't allow me to sort the files by date because it has more than 1000 elements我不能手动操作,因为存储桶不允许我按日期对文件进行排序,因为它有 1000 多个元素

用于排序的桶限制大小

I've seen some questions that could work using AWS CLI, but I don't have enough user permissions to use the AWS CLI, so I have to do it with a boto3 python script that I'm going to upload into a lambda.我已经看到了一些可以使用 AWS CLI 解决的问题,但是我没有足够的用户权限来使用 AWS CLI,所以我必须使用我将要上传到 lambda 的boto3 python 脚本来解决。

How can I do this?我怎样才能做到这一点?

If your application uploads files periodically, you could try this:如果您的应用程序定期上传文件,您可以试试这个:

import boto3
import datetime

last_n_days = 250
s3 = boto3.client('s3')

paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket='bucket', Prefix='processed')
date_limit = datetime.datetime.now() - datetime.timedelta(30)
for page in pages:
    for obj in page['Contents']:
        if obj['LastModified'] >= date_limit and obj['Key'][-1] != '/':
             s3.download_file('bucket', obj['Key'], obj['Key'].split('/')[-1])

With the script above, all files modified in the last 250 days will be downloaded.使用上面的脚本,将下载过去 250 天内修改的所有文件。 If your application uploads 4 files per day, this could do the fix.如果您的应用程序每天上传 4 个文件,这可以解决问题。

The best solution is to redefine your problem: rather than retrieving the N most recent files, retrieve all files from the N most recent days.最好的解决方案是重新定义您的问题:而不是检索 N 个最近的文件,而是检索 N 个最近几天的所有文件。 I think that you'll find this to be a better solution in most cases.我认为在大多数情况下,您会发现这是一个更好的解决方案。

However, to make it work you'll need to adopt some form of date-stamped prefix for the uploaded files.但是,要使其正常工作,您需要为上传的文件采用某种形式的带日期戳的前缀。 For example, 2021-04-16/myfile.csv .例如, 2021-04-16/myfile.csv

If you feel that you must retrieve N files, then you can use the prefix to retrieve only a portion of the list.如果您觉得必须检索 N 个文件,则可以使用前缀仅检索列表的一部分。 Assuming that you know that you have approximately 100 files uploaded per day, then start your bucket listing with 2021-04-05/ .假设您知道您每天上传大约 100 个文件,然后以2021-04-05/开始您的存储桶列表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM