簡體   English   中英

使用 AWS Python SDK boto3 從表中列出由 AWS Glue 解析的所有 S3 文件

[英]List all S3-files parsed by AWS Glue from a table using the AWS Python SDK boto3

我試圖通過Glue API 文檔找到一種方法,但沒有與函數get_table(**kwargs)get_tables(**kwargs)相關的屬性或方法。

我想象一些類似於以下(偽)代碼的東西:

client = boto3.client('glue')
paginator = client.get_paginator('get_tables')
for response in paginator.paginate(DatabaseName=db_input_shared):
    for table in response['TableList']:
        files = table["files"]  # NOTE: the keyword "files" is invented
        # Do something else
        ...

據我可以從文檔看到, tablereponse["TableList"]應該是一個字典; 然而,它的所有密鑰似乎都無法訪問存儲在其中的文件。

該問題的解決方案是使用awswrangler

以下函數檢查數據庫中的所有AWS Glue表,以獲取最近上傳文件的特定列表。 每當文件名匹配時,它將產生關聯的表字典。 這些生成的表是最近更新的表。

def _yield_recently_updated_glue_tables(upload_path_list: List[str],
                                        db_name: str) -> Union(dict, None):
    """Check which tables have been updated recently.

    Args:
        upload_path_list (List[str]): contains all S3-filepaths of recently uploaded files
        db_name (str): name of the AWS Glue database

    Yields:
        Union(dict, None): AWS Glue table dictionaries recently updated
    """
    client = boto3.client('glue')
    paginator = client.get_paginator('get_tables')
    for response in paginator.paginate(DatabaseName=db_name):
        for table_dict in response['TableList']:
            table_name = table_dict['Name']
            s3_bucket_path = awswrangler.catalog.get_table_location(
                database=db_name, table=table_name)
            s3_filepaths = list(
                awswrangler.s3.describe_objects(s3_bucket_path).keys())
            table_was_updated = False
            for upload_file in upload_path_list:
                if upload_file in s3_filepaths:
                    table_was_updated = True
                    break
            if table_was_updated:
                yield table_dict

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM