简体   繁体   中英

Search specific file in AWS S3 bucket using python

I have AWS S3 access and the bucket has nearly 300 files inside the bucket. I need to download single file from this bucket by pattern matching or search because i do not know the exact filename (Say files ends with .csv format).
Here is my sample code which shows all files inside the bucket

def s3connection(credentialsdict):
    """
    :param access_key: Access key for AWS to establish S3 connection
    :param secret_key: Secret key for AWS to establish S3 connection
    :param file_name: file name of the billing file(csv file)
    :param bucket_name: Name of the bucket which consists of billing files
    :return: status, billing_bucket, billing_key
    """
    os.environ['S3_USE_SIGV4'] = 'True'
    conn = S3Connection(credentialsdict["access_key"], credentialsdict["secret_key"], host='s3.amazonaws.com')
    billing_bucket = conn.get_bucket(credentialsdict["bucket_name"], validate=False)
    try:
        billing_bucket.get_location()
    except S3ResponseError as e:
        if e.status == 400 and e.error_code == 'AuthorizationHeaderMalformed':
            conn.auth_region_name = ET.fromstring(e.body).find('./Region').text
    billing_bucket = conn.get_bucket(credentialsdict["bucket_name"])
    print billing_bucket

    if not billing_bucket:
        raise Exception("Please Enter valid bucket name. Bucket %s does not exist"
                        % credentialsdict.get("bucket_name"))
    for key in billing_bucket.list():
        print key.name
    del os.environ['S3_USE_SIGV4']

Can I pass search string to retrieve the exact matched filenames?

There is no way to do this because there is no native support for regex in S3. You have to get the entire list and apply the search/regex at the client side. The only filtering option available in list_objects is by prefix .

list_objects

Prefix (string) -- Limits the response to keys that begin with the specified prefix.

One option is to use the Python module re and apply it to the list of objects.

import re
pattern = re.compile(<file_pattern_you_are_looking_for>)
for key in billing_bucket.list():
    if pattern.match(key.name):
        print key.name

You can also use the simple if condition like,

prefix_objs = buck.objects.filter(Prefix="your_bucket_path")

for obj in prefix_objs:
    key = obj.key
    if key.endswith(".csv"):
        body = obj.get()['Body'].read()
        print(obj.key)

You can use JMESPath expressions to search and filter down S3 files. To do that you need to get s3 paginator over list_objects_v2 .

import boto3
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(Bucket="your_bucket_name")

Now that you have iterator you can use JMESPath search. Most useful is contains - to do %like% query

objects = page_iterator.search("Contents[?contains(Key, `partial-file-name`)][]")

But in your case (to find all files ending .csv it's better to use ends_with - to do *.csv query

objects = page_iterator.search("Contents[?ends_with(Key, `.csv`)][]")

Then you can get object keys with

for item in objects:
    print(item['Key'])

This answer is based on https://blog.jeffbryner.com/2020/04/21/jupyter-pandas-analysis.html and https://stackoverflow.com/a/27274997/4587704

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM