简体   繁体   中英

How to delete files that matches a specific pattern in S3 bucket?

I have an S3 bucket that I'm saving CSV files for loading them into Redshift. I'm using Python and Boto3 for this. After loading them into Redshift I want to delete specific files that match a pattern that contains the processing ID for my code.

I'm saving my files into S3 bucket as follows

Redshift{processingID}-table1.csv
Redshift{processingID}-table2.csv
Redshift{processingID}-table3.csv
Redshift{processingID}-table4.csv

After processing those files that contains specific ID, I want to delete the processed files from my S3 bucket . How do I specify the pattern.

This is the pattern that I'm trying to delete the files from bucket.

Redshift11-*.csv . Here 11 is the processingID . How do I delete all the files that matches the pattern using boto3 ?

I've come across this. https://stackoverflow.com/a/53836093/4626254

But it seems like it is searching for the folder as prefix and not the exact pattern of the file.

You can do prefix filtering server-side but you'll have to do suffix-filtering client-side. For example:

import boto3
s3 = boto3.resource('s3')

bucket = s3.Bucket('mybucket')
files = [os.key for os in bucket.objects.filter(Prefix="myfolder/Redshift11-")]
csv_files = [file for file in files if file.endswith('.csv')]

print(f'All files: {files}')
print(f'CSV files: {csv_files}')

There's no way to tell S3 to delete files that meet a specific pattern - you just have to delete one file at a time. You can list keys with a specific prefix

ex: Redshift or application_name_used_as_prefix ) by modifying your file naming to have a unique prefix.

Or if you need to rely on regex, then you must specify start and end rules like:

import re

pattern = r"Redshift([0-9]+)-(\w+).csv$"
re.match(pattern, 'Redshift2-table1.csv')

Hope this helps!

If you are using AWS CLI you can filter LS results with grep regex and delete them. For example

aws s3 ls s3://BUCKET | awk '{print $4}' | grep -E -i '^2018-([0-9][0-9])\-([0-9][0-9])\-([0-9][0-9])\-([0-9][0-9])\-([0-9][0-9])\-([0-9a-zA-Z]*)' | xargs -I% bash -c 'aws s3 rm s3://BUCKET/%'

This is slow but it works

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM