简体   繁体   English

从 S3 复制有限数量的文件?

[英]Copy limited number of files from S3?

We are using an S3 bucket to store a growing number of small JSON files (~1KB each) that contain some build-related data.我们正在使用 S3 存储桶来存储越来越多的 JSON 小文件(每个约 1KB),其中包含一些与构建相关的数据。 Part of our pipeline involves copying these files from S3 and putting them into memory to do some operations.我们的部分管道涉及从 S3 复制这些文件并将它们放入 memory 以执行一些操作。

That copy operation is done via S3 cli tool command that looks something like this:该复制操作是通过 S3 cli 工具命令完成的,如下所示:

aws s3 cp s3://bucket-path ~/some/local/path/ --recursive --profile dev-profile

The problem is that the number of json files on S3 is getting pretty large since more are being made every day.问题是 S3 上的 json 文件数量越来越大,因为每天都在制作更多文件。 It's nothing even close to the capacity of the S3 bucket since the files are so small.由于文件非常小,它甚至无法接近 S3 存储桶的容量。 However, in practical terms, there's no need to copy ALL these JSON files.但是,实际上,没有必要复制所有这些 JSON 文件。 Realistically the system would be safe just copying the most recent 100 or so.实际上,系统只复制最近的 100 个左右就安全了。 But we do want to keep older ones around for other purposes.但我们确实希望保留旧的用于其他目的。

So my question boils down to: is there a clean way to copy a specific number of files from S3 (maybe sorted by most recent)?所以我的问题归结为:是否有一种干净的方法可以从 S3 复制特定数量的文件(可能按最近排序)? Is there some kind of pruning policy we can set on an S3 bucket to delete files older than X days or something?我们可以在 S3 存储桶上设置某种修剪策略来删除早于 X 天的文件吗?

  1. You can set the Lifecycle policies to the S3 buckets which will remove them after certain period of time.您可以将生命周期策略设置为 S3 存储桶,这将在一段时间后删除它们。
  2. To copy only some days old objects you will need to write a script要仅复制几天前的对象,您需要编写一个脚本

The aws s3 sync command in the AWS CLI sounds perfect for your needs. AWS CLI 中的aws s3 sync命令听起来非常适合您的需求。

It will copy only files that are New or Modified since the last sync .它将仅复制自上次同步以来新建或修改的文件 However, it means that the destination will need to retain a copy of the 'old' files so that they are not copied again.然而,这意味着目的地将需要保留“旧”文件的副本,以便它们不会被再次复制。

Alternatively, you could write a script (eg in Python) that lists the objects in S3 and then only copies objects added since the last time the copy was run.或者,您可以编写一个脚本(例如在 Python 中)列出 S3 中的对象,然后仅复制自上次运行副本以来添加的对象。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将文件从 ec2 复制到 s3 - Copy files from ec2 to s3 从 S3 复制到具有不同日期的 GCS 文件 - Copy from S3 to GCS files with different date 将文件从 S3 SignedURL 复制到 GCS Signed URL - Copy Files from S3 SignedURL to GCS Signed URL 将 json 文件从一个 s3 存储桶复制到另一个 s3 存储桶时,无法识别 Json 文件? - Json file is not recognising when copy json files from one s3 bucket to another s3 bucket? 更快地复制 S3 文件的方法 - Faster way to Copy S3 files 将 300 万个 S3 文件复制到特定文件夹 - Copy 3 million S3 files to specific folders 将文件从一个 AWS 帐户的 S3 存储桶复制到另一个 AWS 帐户的 S3 存储桶 + 使用 NodeJS - Copy files from one AWS account's S3 bucket to another AWS account's S3 bucket + using NodeJS 将文件复制并合并到另一个 S3 存储桶 - Copy and Merge files to another S3 bucket 将文件从 AWS S3 复制到 Snowflake 表 - 执行复制并处理 0 个文件 - Copying files from AWS S3 to Snowflake Table - Copy executed with 0 files processed AWS Lambda 尝试将文件从 S3 存储桶复制到另一个 S3 存储桶时出现无效存储桶名称错误 - Invalid bucket name error when AWS Lambda tries to copy files from an S3 bucket to another S3 bucket
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM