[英]How can I get ONLY files from S3 with python aioboto3 or boto3?
[英]How to get ONLY bottom level sub-folders from amazon S3 with aioboto3 fast and asynchronously
我問過類似的問題如何使用 python boto3 獲取所有子目錄,除 AWS S3 中的文件外所有級別的深度,還有其他人也有類似問題,但這更具體。 我可以使用 boto3 客戶端(或 aioboto3 用於異步代碼)從 S3 獲取任何任意深度的所有子文件夾,但它非常慢,它帶回所有對象,然后我使用如下代碼過濾:
subfolders = set()
prefix_tasks = [get_subfolders(bucket, prefix) for prefix in prefixes]
try:
for prefix_future in asyncio.as_completed(prefix_tasks):
prefix_subfolders = await prefix_future
subfolders.update(prefix_subfolders)
except KeyError as exc:
print(f"Scanning origin bucket failed due to: {exc}")
raise exc
我的get_subfolders
函數在哪里:
async def get_subfolders(self, bucket: str, prefix: str) -> Set[str]:
subfolders = set()
result = await self.s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)
objects = result.get("Contents")
subfolders.update(await self._get_paths_by_depth(objects=objects, depth=4))
# Use next continuation token for pagination for truncated results.
while result["IsTruncated"]:
result = await self.s3_client.list_objects_v2(
Bucket=bucket,
Prefix=prefix,
ContinuationToken=result["NextContinuationToken"],
)
objects = result.get("Contents")
subfolders.update(await self._get_paths_by_depth(objects=objects, depth=4))
return subfolders
我的get_paths_by_depth()
函數是:
async def get_paths_by_depth(self, objects: dict, depth: int) -> Set[str]:
subfolders = set()
current_path = None
try:
# Get only paths with depth equal to 'depth' levels
for bucket_object in objects:
current_path = os.path.dirname(bucket_object["Key"])
if current_path.count("/") == depth:
subfolders.add(current_path)
except Exception as exc:
print(f"Getting subfolders failed due to error: {exc}")
raise exc
return subfolders
有什么辦法可以加快速度嗎? 我真的想避免帶回所有文件然后過濾掉路徑。 我可以立即要求特定長度的路徑嗎?
所以我的文件結構是這樣的:
prefix/subfolder1/subfolder2/subfolder3/file1.txt
prefix/subfolder1/subfolder2/subfolder3/file2.json
prefix/subfolder4/subfolder5/file3.json
prefix/subfolder6/subfolder7/subfolder8/
並且我只想獲得在上述情況下以至少一個文件結尾的路徑,我想在最后獲得:
prefix/subfolder1/subfolder2/subfolder3/
prefix/subfolder4/subfolder5/
到目前為止,我在問題中發布的代碼正在查看存儲桶中的每個文件並將其路徑保留在一個集合中。 那行得通,但花費的時間太長了。
一種更快的方法是在 S3 請求中使用Delimiter
參數。 具體來說,我使用了“。” 分隔符,它改變 s3_client 的響應,它包括存儲桶中的所有 CommonPrefixes,其中包括“.”。 由於所有文件都包含一個“.” 我通過單個請求獲取所有常見前綴,而不是檢查每個文件。 新代碼是這樣的:
async def get_subfolders(
self, bucket: str, prefix: str, delimiter: str = "."
) -> Set[str]:
subfolders = set()
foldername = None
try:
paginator = self.s3_client.get_paginator("list_objects")
async for result in paginator.paginate(
Bucket=bucket, Prefix=prefix, Delimiter=delimiter
):
for obj in result.get("CommonPrefixes", []):
foldername = os.path.dirname(obj["Prefix"])
# Get only paths with depth greater or equal than S3_FOLDERS_PATH_DEPTH
if foldername.count("/") >= S3_FOLDERS_PATH_DEPTH:
subfolders.add(foldername)
except Exception as exc:
print(f"Getting subfolders failed due to error: {exc}")
raise exc
return subfolders
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.