简体   繁体   English

从S3存储桶下载数百万个文件

[英]Download million files from S3 bucket

I have million of files in different folders on S3 bucket. 我在S3存储桶的不同文件夹中有数百万个文件。

The files are very small. 文件很小。 I wish to download all the files that are under folder named VER1 . 我希望下载名为VER1文件夹下的所有文件。 The folder VER1 contains many subfolders, I wish to download all the million files under all the subfolders of VER1 . 文件夹VER1包含许多子文件夹,我希望下载VER1所有子文件夹下的所有百万个文件。

(eg VER1 -> sub1 -> file1.txt , VER1 -> sub1 -> subsub1 -> file2.text , etc.) (例如VER1 - > sub1 - > file1.txtVER1 - > sub1 - > subsub1 - > file2.text等)

What is the fastest way to download all the files? 下载所有文件的最快方法是什么?

Using s3 cp ? 使用s3 cp吗? s3 sync ? s3 sync

Is there a way to download all the files located under the folder in parallel? 有没有办法并行下载位于该文件夹下的所有文件?

Use the AWS Command-Line Interface (CLI) : 使用AWS命令行界面(CLI)

aws s3 sync s3://bucket/VER1 [name-of-local-directory]

From my experience, it will download in parallel but it won't necessarily use the full bandwidth because there is a lot of overhead for each object. 根据我的经验,它将并行下载,但不一定会使用全部带宽,因为每个对象都有很多开销。 (It is more efficient for large objects, since there is less overhead.) (因为开销较小,所以对于大型对象而言效率更高。)

It is possible that aws s3 sync might have problems with a large number of files. 可能aws s3 sync可能有大量的文件的问题。 You'd have to try it to see whether it works. 您必须尝试一下它是否有效。

If you really wanted full performance, you could write your own code that downloads in massive parallel, but the time saving would probably be lost in the time it takes you to write and test such a program. 如果您确实想要完整的性能,则可以编写自己的代码,这些代码可以大量并行下载,但是节省时间可能会浪费在编写和测试这样的程序上。

Another option is to use aws s3 sync to download to an Amazon EC2 instance, then zip the files and simply download the zip file. 另一个选择是使用aws s3 sync下载到Amazon EC2实例,然后压缩文件并仅下载压缩文件。 That would reduce bandwidth requirements. 这将减少带宽需求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM