简体繁体 English

从 S3 复制有限数量的文件？

[英]Copy limited number of files from S3?

原文 2020-10-09 21:39:00 5 2 amazon-web-services/ amazon-s3

We are using an S3 bucket to store a growing number of small JSON files (~1KB each) that contain some build-related data.我们正在使用 S3 存储桶来存储越来越多的 JSON 小文件（每个约 1KB），其中包含一些与构建相关的数据。 Part of our pipeline involves copying these files from S3 and putting them into memory to do some operations.我们的部分管道涉及从 S3 复制这些文件并将它们放入 memory 以执行一些操作。

That copy operation is done via S3 cli tool command that looks something like this:该复制操作是通过 S3 cli 工具命令完成的，如下所示：

aws s3 cp s3://bucket-path ~/some/local/path/ --recursive --profile dev-profile

The problem is that the number of json files on S3 is getting pretty large since more are being made every day.问题是 S3 上的 json 文件数量越来越大，因为每天都在制作更多文件。 It's nothing even close to the capacity of the S3 bucket since the files are so small.由于文件非常小，它甚至无法接近 S3 存储桶的容量。 However, in practical terms, there's no need to copy ALL these JSON files.但是，实际上，没有必要复制所有这些 JSON 文件。 Realistically the system would be safe just copying the most recent 100 or so.实际上，系统只复制最近的 100 个左右就安全了。 But we do want to keep older ones around for other purposes.但我们确实希望保留旧的用于其他目的。

So my question boils down to: is there a clean way to copy a specific number of files from S3 (maybe sorted by most recent)?所以我的问题归结为：是否有一种干净的方法可以从 S3 复制特定数量的文件（可能按最近排序）？ Is there some kind of pruning policy we can set on an S3 bucket to delete files older than X days or something?我们可以在 S3 存储桶上设置某种修剪策略来删除早于 X 天的文件吗？

2 个解决方案

You can set the Lifecycle policies to the S3 buckets which will remove them after certain period of time.您可以将生命周期策略设置为 S3 存储桶，这将在一段时间后删除它们。
To copy only some days old objects you will need to write a script要仅复制几天前的对象，您需要编写一个脚本

The aws s3 sync command in the AWS CLI sounds perfect for your needs. AWS CLI 中的aws s3 sync命令听起来非常适合您的需求。

It will copy only files that are New or Modified since the last sync .它将仅复制自上次同步以来新建或修改的文件。 However, it means that the destination will need to retain a copy of the 'old' files so that they are not copied again.然而，这意味着目的地将需要保留“旧”文件的副本，以便它们不会被再次复制。

Alternatively, you could write a script (eg in Python) that lists the objects in S3 and then only copies objects added since the last time the copy was run.或者，您可以编写一个脚本（例如在 Python 中）列出 S3 中的对象，然后仅复制自上次运行副本以来添加的对象。