简体   繁体   中英

How to retrieve subfolders and files from a folder in S3 bucket using boto3?

as I am quite new to loading from AWS S3 bucket i am facing some difficulties to query data from subfolders here is the steps and bucket description:

Countries S3 bucket

  • subfolder for every extraction time eg(2021-08-12 , 2021-08-11 ,...)
    • each subfolder has the following:
      • 2 sub-subfolders each contains json files that i need to query
      • other json files also need to query

code produced so far:

s3 = boto3.resource("s3")
# Get bucket
bucket_name = "countries"
bucket = s3.Bucket(name=bucket_name)
path = "countries/"

1- This steps fetches all the outer subfolders with extraction time

folders = []

client = boto3.client('s3')
result = client.list_objects(Bucket=bucket_name, Prefix=path, Delimiter='/')
for o in result.get('CommonPrefixes'):
    folders.append(o.get('Prefix'))

2- Next iterate for every subfolder extract all the content inside

   for i in folders:
     sub = client.list_objects(Bucket = bucket_name , Prefix = folders[i] , Delimiter = '/')

3- Next extract the jsons and subfolders , do all append or join

Currently i am failing with 2nd step as i pass Prefix = folders[i] while indexing by passing folders[0] returns the content of one subfolder from step 1, i am trying to iterate but get back this error TypeError: list indices must be integers or slices, not str

I was able to solve as follows:

~~~copied from question~~~
s3 = boto3.resource("s3")
# Get bucket
bucket_name = "countries"
bucket = s3.Bucket(name=bucket_name)
path = "countries/"

~~~~~~

# Create a reusable Paginator
paginator = client.get_paginator('list_objects_v2')

# create list to store .json files
json_files = []

# Search thru each folder that was returned via path
def path_search(path_folders):
    for folder in path_folders:
        folder_search(folder)


def folder_search(folder):
    # get subfolders in folder
    list_objects_in_folder = client.list_objects(Bucket=bucket_name, Prefix=folder, Delimiter='/')
    
    #---- Looking for .json files -----
    
    # Create a PageIterator from the Paginator
    page_iterator = paginator.paginate(Bucket=bucket_name, Prefix=folder, Delimiter='/')
    
    # Filter results for json files only. Can use JMESPath for other searches
    filtered_iterator = page_iterator.search("Contents[?ends_with(Key, '.json')][]")
    
    for key_data in filtered_iterator:
        if key_data != None:
            json_files.append(key_data['Key'])
        else:
            pass
    
    #-----------------------------------------------------------
    # get filepath for each item in the folder:
    try:
        for item in list_objects_in_folder.get("CommonPrefixes"):
            filepath = item.get("Prefix")
            # start the loop over again
            folder_search(filepath)
    except:
        # if folder is empty
        pass

This code will iterate thru every folder in a given path, look for .json files and subfolders, enter all subfolders, and repeat.

You can also find more help at these locations:

https://jmespath.org/examples.html

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.download_file

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM