简体   繁体   English

使用 AWS Python SDK boto3 从表中列出由 AWS Glue 解析的所有 S3 文件

[英]List all S3-files parsed by AWS Glue from a table using the AWS Python SDK boto3

I tried to find a way through the Glue API docs , but there is no attribute or method related to the functions get_table(**kwargs) or get_tables(**kwargs) .我试图通过Glue API 文档找到一种方法,但没有与函数get_table(**kwargs)get_tables(**kwargs)相关的属性或方法。

I imagine something akin to the following (pseudo-)code:我想象一些类似于以下(伪)代码的东西:

client = boto3.client('glue')
paginator = client.get_paginator('get_tables')
for response in paginator.paginate(DatabaseName=db_input_shared):
    for table in response['TableList']:
        files = table["files"]  # NOTE: the keyword "files" is invented
        # Do something else
        ...

As far as I can see from the docs, the table from the reponse["TableList"] should be a dictionary;据我可以从文档看到, tablereponse["TableList"]应该是一个字典; yet none of its keys seem to give access to the files stored in it.然而,它的所有密钥似乎都无法访问存储在其中的文件。

The solution to the problem was using awswrangler .该问题的解决方案是使用awswrangler

The following functions checks all AWS Glue Tables within a database for a specific list of recently uploaded files .以下函数检查数据库中的所有AWS Glue表,以获取最近上传文件的特定列表。 Whenever the filename matches, it is going to yield the associated table dictionary.每当文件名匹配时,它将产生关联的表字典。 These yielded tables are those which have been recently updated.这些生成的表是最近更新的表。

def _yield_recently_updated_glue_tables(upload_path_list: List[str],
                                        db_name: str) -> Union(dict, None):
    """Check which tables have been updated recently.

    Args:
        upload_path_list (List[str]): contains all S3-filepaths of recently uploaded files
        db_name (str): name of the AWS Glue database

    Yields:
        Union(dict, None): AWS Glue table dictionaries recently updated
    """
    client = boto3.client('glue')
    paginator = client.get_paginator('get_tables')
    for response in paginator.paginate(DatabaseName=db_name):
        for table_dict in response['TableList']:
            table_name = table_dict['Name']
            s3_bucket_path = awswrangler.catalog.get_table_location(
                database=db_name, table=table_name)
            s3_filepaths = list(
                awswrangler.s3.describe_objects(s3_bucket_path).keys())
            table_was_updated = False
            for upload_file in upload_path_list:
                if upload_file in s3_filepaths:
                    table_was_updated = True
                    break
            if table_was_updated:
                yield table_dict

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM