[英]List all objects in AWS S3 bucket with their storage class using Boto3 Python
[英]List all S3-files parsed by AWS Glue from a table using the AWS Python SDK boto3
我試圖通過Glue API 文檔找到一種方法,但沒有與函數get_table(**kwargs)
或get_tables(**kwargs)
相關的屬性或方法。
我想象一些類似於以下(偽)代碼的東西:
client = boto3.client('glue')
paginator = client.get_paginator('get_tables')
for response in paginator.paginate(DatabaseName=db_input_shared):
for table in response['TableList']:
files = table["files"] # NOTE: the keyword "files" is invented
# Do something else
...
據我可以從文檔看到, table
從reponse["TableList"]
應該是一個字典; 然而,它的所有密鑰似乎都無法訪問存儲在其中的文件。
該問題的解決方案是使用awswrangler 。
以下函數檢查數據庫中的所有AWS Glue
表,以獲取最近上傳文件的特定列表。 每當文件名匹配時,它將產生關聯的表字典。 這些生成的表是最近更新的表。
def _yield_recently_updated_glue_tables(upload_path_list: List[str],
db_name: str) -> Union(dict, None):
"""Check which tables have been updated recently.
Args:
upload_path_list (List[str]): contains all S3-filepaths of recently uploaded files
db_name (str): name of the AWS Glue database
Yields:
Union(dict, None): AWS Glue table dictionaries recently updated
"""
client = boto3.client('glue')
paginator = client.get_paginator('get_tables')
for response in paginator.paginate(DatabaseName=db_name):
for table_dict in response['TableList']:
table_name = table_dict['Name']
s3_bucket_path = awswrangler.catalog.get_table_location(
database=db_name, table=table_name)
s3_filepaths = list(
awswrangler.s3.describe_objects(s3_bucket_path).keys())
table_was_updated = False
for upload_file in upload_path_list:
if upload_file in s3_filepaths:
table_was_updated = True
break
if table_was_updated:
yield table_dict
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.