简体   繁体   中英

How to download, decompress and transfer multiple files directly into an s3 bucket?

My problem is the following: I would like to download a dataset hosted somewhere using its url, decompress it and upload the files (eg images) to an s3 bucket. An example of dataset could be CIFAR-100: https://www.cs.toronto.edu/~kriz/cifar.html and the dataset url would be https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz

Note that in some cases the dataset is huge, so downloading it first to my local computer is simply not an option. I thought about creating a pipe to do it as streamlined as possible. The command below works for single files (eg single image):

curl "url/single_image.tar.gz" | tar xvz | aws s3 cp - s3://my_bucket/single_image.jpg

But if the compressed folder contains eg multiple images, the command above does not longer work since it requires to specify the destination filename and extension.

Which is the simplest solution to this problem?

Use gnu tar with the --to-command option which allows you to:

Extract files and pipe their contents to the standard input of command. When this option is used, instead of creating the files specified, tar invokes command and pipes the contents of the files to its standard output.

It even supports the following:

The command can obtain the information about the file it processes from the following environment variables:

TAR_FILENAME The name of the file.

The following command should do what you want:

curl https://xxxxx/test.tar | tar -xz --to-command='aws s3 cp - s3://bucket/$TAR_FILENAME'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM