简体   繁体   中英

shell script - AWS s3 bucket, find last added file

I'm trying to get last added file in AWS s3 bucket using linux shell script. Can you anyone let me know how I can do this job?

One way is to use the output of s3cmd and sort the output to get the last added file.

s3cmd ls s3://{{bucket}} | sort | tail -n 1 | awk '{print $2}'
  • sort - sorts the output by creation time
  • tail -n 1 - returns the last file
  • awk '{print $2}' - prints the file name

Recommendation, tl;dr

The best compromise for a simple command that is performant, at the time of this writing based on the simplistic performance test, would be aws s3 ls --recursive (Option #2)


3 ways to get the last modified object

1. Using s3cmd

(See s3cmd Usage , or explore the man page after installing it using sudo pip install s3cmd )

s3cmd ls s3://the-bucket | sort| tail -n 1

2. Using AWS CLI's s3

aws s3 ls the-bucket --recursive --output text | sort | tail -n 1 | awk '{print $1"T"$2","$3","$4}'

(Note that awk in the above refers to GNU awk. See this if you need to install this, as well as for any other GNU utilities on macOS)


3. Using AWS CLI's s3api

(with either list-objects or list-objects-v2 )

aws s3api list-objects-v2 --bucket the-bucket | jq  -r '.[] | max_by(.LastModified) | [.Key, .LastModified, .Size]|@csv'

Note that both of the s3api commands are paginated and handling the pagination is a fundamental improvement in v2 of the list-objects.

If the bucket has more than a 1000 objects (use s3cmd du "s3://ons-dap-s-logs" | awk '{print $2}' to get the number of objects), then you'll need to handle pagination of the API and make multiple calls to get back all the results since the sort order of the returned results is UTF-8 binary order and not 'Last Modified'.


Performance comparison

Here is a simple performance comparison of the above three methods executed for the same bucket. For simplicity, the bucket had less than a 1000 objects. Here is the one-liner to see the execution times:

export bucket_name="the-bucket" && \
( \
time ( s3cmd     ls --recursive           "s3://${bucket_name}"             | awk '{print $1"T"$2","$3","$4}' | sort | tail -n 1                       ) & ; \
time ( aws s3    ls --recursive           "${bucket_name}"    --output text | awk '{print $1"T"$2","$3","$4}' | sort | tail -n 1                       ) & ; \
time ( aws s3api list-objects-v2 --bucket "${bucket_name}"                  | jq  -r '.[] | max_by(.LastModified) | [.LastModified, .Size, .Key]|@csv' ) & ; \
time ( aws s3api list-objects    --bucket "${bucket_name}"                  | jq  -r '.[] | max_by(.LastModified) | [.LastModified, .Size, .Key]|@csv' ) &
) >! output.log

( output.log will store the last modified objects listed by each command)

The output of the above is as follows:

( s3cmd ls --recursive ...)      1.10s user 0.10s system 79% cpu 1.512 total
( aws s3 ls --recursive ...)     0.72s user 0.12s system 74% cpu 1.128 total
( aws s3api list-objects-v2 ...) 0.54s user 0.11s system 74% cpu 0.867 total
( aws s3api list-objects ...)    0.57s user 0.11s system 75% cpu 0.900 total

For the same number of objects being returned, aws s3api calls are appreciably more performant; however, there is the additional (scripting) complexity for dealing with the pagination of the API.

Useful link(s): See Leveraging s3 and s3api to understand the difference between aws s3 and aws s3api

That's not possible. S3 is not a database or filesystem.

However, with S3 queries you can request a list of objects that were created or modified after a certain date:

aws s3api list-objects --bucket "YOURBUCKET" --query 'Contents[?LastModified>=2016-12-27][].{Key: Key}'

And if you want only added objects, not modified, you'll have to create custom metadata attribute, save it with object and query based on that custom attribute.

aws s3 ls s3://your-bucket --recursive | sort | tail -n 1

This command will recursively check all files in all folders and subfolders of an S3 bucket, and return the name of the file most recently modified as well as the timestamp of that modification.

(Note, awscli should be installed first and configured with your AWS account info. See https://docs.aws.amazon.com/codedeploy/latest/userguide/getting-started-configure-cli.html .)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM