as I am quite new to loading from AWS S3 bucket i am facing some difficulties to query data from subfolders here is the steps and bucket description:
Countries S3 bucket
- subfolder for every extraction time eg(2021-08-12 , 2021-08-11 ,...)
- each subfolder has the following:
- 2 sub-subfolders each contains json files that i need to query
- other json files also need to query
code produced so far:
s3 = boto3.resource("s3")
# Get bucket
bucket_name = "countries"
bucket = s3.Bucket(name=bucket_name)
path = "countries/"
folders = []
client = boto3.client('s3')
result = client.list_objects(Bucket=bucket_name, Prefix=path, Delimiter='/')
for o in result.get('CommonPrefixes'):
folders.append(o.get('Prefix'))
for i in folders:
sub = client.list_objects(Bucket = bucket_name , Prefix = folders[i] , Delimiter = '/')
Currently i am failing with 2nd step as i pass Prefix = folders[i]
while indexing by passing folders[0] returns the content of one subfolder from step 1, i am trying to iterate but get back this error TypeError: list indices must be integers or slices, not str
I was able to solve as follows:
~~~copied from question~~~
s3 = boto3.resource("s3")
# Get bucket
bucket_name = "countries"
bucket = s3.Bucket(name=bucket_name)
path = "countries/"
~~~~~~
# Create a reusable Paginator
paginator = client.get_paginator('list_objects_v2')
# create list to store .json files
json_files = []
# Search thru each folder that was returned via path
def path_search(path_folders):
for folder in path_folders:
folder_search(folder)
def folder_search(folder):
# get subfolders in folder
list_objects_in_folder = client.list_objects(Bucket=bucket_name, Prefix=folder, Delimiter='/')
#---- Looking for .json files -----
# Create a PageIterator from the Paginator
page_iterator = paginator.paginate(Bucket=bucket_name, Prefix=folder, Delimiter='/')
# Filter results for json files only. Can use JMESPath for other searches
filtered_iterator = page_iterator.search("Contents[?ends_with(Key, '.json')][]")
for key_data in filtered_iterator:
if key_data != None:
json_files.append(key_data['Key'])
else:
pass
#-----------------------------------------------------------
# get filepath for each item in the folder:
try:
for item in list_objects_in_folder.get("CommonPrefixes"):
filepath = item.get("Prefix")
# start the loop over again
folder_search(filepath)
except:
# if folder is empty
pass
This code will iterate thru every folder in a given path, look for .json files and subfolders, enter all subfolders, and repeat.
You can also find more help at these locations:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.