简体   繁体   English

强制 Glue Crawler 创建单独的表

[英]Force Glue Crawler to create separate tables

I am continuously add parquet data sets to an S3 folder with a structure like this:我不断地将镶木地板数据集添加到结构如下的 S3 文件夹中:

s3:::my-bucket/public/data/set1
s3:::my-bucket/public/data/set2
s3:::my-bucket/public/data/set3

At the beginning I only have set1 and my crawler is configured to run on the whole bucket s3:::my-bucket .一开始我只有set1并且我的爬虫被配置为在整个桶s3:::my-bucket上运行。 This leads to the creation of a partitioned tabled named my-bucket with partitions named public , data and set1 .这导致创建一个名为my-bucket的分区表,其中分区名为publicdataset1 What I actually want is to have a table named set1 without any partitions.我真正想要的是有一个没有任何分区的名为set1的表。 I see the reasons why this happens, as it is explained under How Does a Crawler Determine When to Create Partitions?我明白了发生这种情况的原因,正如爬虫如何确定何时创建分区? . . But when a new data set is uploaded (eg set2 ) I don't want it to be another partition (because it is completely different data with a different schema).但是当一个新的数据集被上传时(例如set2 )我不希望它成为另一个分区(因为它是具有不同模式的完全不同的数据)。 How can I force the Glue crawler to NOT create partitions?如何强制 Glue 爬虫不创建分区? I know I could define the crawler path as s3:::my-bucket/public/data/ but unfortunately I don't know where the new data sets will be created (eg could also be s3:::my-bucket/other/folder/set2 ).我知道我可以将爬虫路径定义为s3:::my-bucket/public/data/但不幸的是我不知道新数据集将在哪里创建(例如也可以是s3:::my-bucket/other/folder/set2 )。

Any ideas how to solve this?任何想法如何解决这个问题?

You can use the TableLevelConfiguration to specify in which folder level the crawler should look for tables.您可以使用TableLevelConfiguration指定搜寻器应在哪个文件夹级别查找表。

More information on that here .更多信息请点击此处

My solution was to manually add the specific paths to the Glue crawler.我的解决方案是手动将特定路径添加到 Glue 爬虫中。 The big picture is that I am using a Glue job to transform data from one S3 bucket and write it to another one.总体情况是,我正在使用 Glue 作业将数据从一个 S3 存储桶转换并写入另一个存储桶。 I now ended up to initially configure the Glue crawler to crawl the whole bucket.我现在最终将 Glue 爬虫初步配置为爬取整个存储桶。 But every time the Glue transformation job runs it also updates the Glue crawler: it removes the initial full bucket location (if it still exists) and then adds the new path to the S3 targets.但每次 Glue 转换作业运行时,它也会更新 Glue 爬虫:它会删除初始的完整存储桶位置(如果它仍然存在),然后将新路径添加到 S3 目标。

In Python it looks something like this:在 Python 中,它看起来像这样:

def update_target_paths(crawler):
    """
    Remove initial include path (whole bucket) from paths and
    add folder for current files to include paths.
    """

    def path_is(c, p):
        return c["Path"] == p

    # get S3 targets and remove initial bucket target
    s3_targets = list(
        filter(
            lambda c: not path_is(c, f"s3://{bucket_name}"),
            crawler["Targets"]["S3Targets"],
        )
    )
    # add new target path if not in targets yet
    if not any(filter(lambda c: path_is(c, output_loc), s3_targets)):
        s3_targets.append({"Path": output_loc})
        logging.info("Appending path '%s' to Glue crawler include path.", output_loc)
    crawler["Targets"]["S3Targets"] = s3_targets
    return crawler


def remove_excessive_keys(crawler):
    """Remove keys from Glue crawler dict that are not needed/allowed to update the crawler"""
    for k in ["State", "CrawlElapsedTime", "CreationTime", "LastUpdated", "LastCrawl", "Version"]:
        try:
            del crawler[k]
        except KeyError:
            logging.warning(f"Key '{k}' not in crawler result dictionary.")
    return crawler


if __name__ == "__main__":
    logging.info(f"Transforming from {input_loc} to {output_loc}.")
    if prefix_exists(curated_zone_bucket_name, curated_zone_key):
        logging.info("Target object already exists, appending.")
    else:
        logging.info("Target object doesn't exist, writing to new one.")
    transform() # do data transformation and write to output bucket
    while True:
        try:
            crawler = get_crawler(CRAWLER_NAME)
            crawler = update_target_paths(crawler)
            crawler = remove_excessive_keys(crawler)

            # Update Glue crawler with updated include paths
            glue_client.update_crawler(**crawler)

            glue_client.start_crawler(Name=CRAWLER_NAME)
            logging.info("Started Glue crawler '%s'.", CRAWLER_NAME)
            break
        except (
            glue_client.exceptions.CrawlerRunningException,
            glue_client.exceptions.InvalidInputException,
        ):
            logging.warning("Crawler still running...")
            time.sleep(10)

Variables defined defined globally: input_loc , output_loc , CRAWLER_NAME , bucket_name .全局定义的变量: input_locoutput_locCRAWLER_NAMEbucket_name

For every new data set a new path is added to the Glue crawler.对于每个新数据集,都会将一个新路径添加到 Glue 爬虫中。 No partitions will be created.不会创建任何分区。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM