强制 Glue Crawler 创建单独的表

Question

I am continuously add parquet data sets to an S3 folder with a structure like this:我不断地将镶木地板数据集添加到结构如下的 S3 文件夹中：

s3:::my-bucket/public/data/set1
s3:::my-bucket/public/data/set2
s3:::my-bucket/public/data/set3

At the beginning I only have set1 and my crawler is configured to run on the whole bucket s3:::my-bucket .一开始我只有set1并且我的爬虫被配置为在整个桶s3:::my-bucket上运行。 This leads to the creation of a partitioned tabled named my-bucket with partitions named public , data and set1 .这导致创建一个名为my-bucket的分区表，其中分区名为public 、 data和set1 。 What I actually want is to have a table named set1 without any partitions.我真正想要的是有一个没有任何分区的名为set1的表。 I see the reasons why this happens, as it is explained under How Does a Crawler Determine When to Create Partitions?我明白了发生这种情况的原因，正如爬虫如何确定何时创建分区？ . . But when a new data set is uploaded (eg set2 ) I don't want it to be another partition (because it is completely different data with a different schema).但是当一个新的数据集被上传时（例如set2 ）我不希望它成为另一个分区（因为它是具有不同模式的完全不同的数据）。 How can I force the Glue crawler to NOT create partitions?如何强制 Glue 爬虫不创建分区？ I know I could define the crawler path as s3:::my-bucket/public/data/ but unfortunately I don't know where the new data sets will be created (eg could also be s3:::my-bucket/other/folder/set2 ).我知道我可以将爬虫路径定义为s3:::my-bucket/public/data/但不幸的是我不知道新数据集将在哪里创建（例如也可以是s3:::my-bucket/other/folder/set2 ）。

Any ideas how to solve this?任何想法如何解决这个问题？

Answer 1

You can use the TableLevelConfiguration to specify in which folder level the crawler should look for tables.您可以使用TableLevelConfiguration指定搜寻器应在哪个文件夹级别查找表。

More information on that here .更多信息请点击此处。

Answer 2

My solution was to manually add the specific paths to the Glue crawler.我的解决方案是手动将特定路径添加到 Glue 爬虫中。 The big picture is that I am using a Glue job to transform data from one S3 bucket and write it to another one.总体情况是，我正在使用 Glue 作业将数据从一个 S3 存储桶转换并写入另一个存储桶。 I now ended up to initially configure the Glue crawler to crawl the whole bucket.我现在最终将 Glue 爬虫初步配置为爬取整个存储桶。 But every time the Glue transformation job runs it also updates the Glue crawler: it removes the initial full bucket location (if it still exists) and then adds the new path to the S3 targets.但每次 Glue 转换作业运行时，它也会更新 Glue 爬虫：它会删除初始的完整存储桶位置（如果它仍然存在），然后将新路径添加到 S3 目标。

In Python it looks something like this:在 Python 中，它看起来像这样：

def update_target_paths(crawler):
    """
    Remove initial include path (whole bucket) from paths and
    add folder for current files to include paths.
    """

    def path_is(c, p):
        return c["Path"] == p

    # get S3 targets and remove initial bucket target
    s3_targets = list(
        filter(
            lambda c: not path_is(c, f"s3://{bucket_name}"),
            crawler["Targets"]["S3Targets"],
        )
    )
    # add new target path if not in targets yet
    if not any(filter(lambda c: path_is(c, output_loc), s3_targets)):
        s3_targets.append({"Path": output_loc})
        logging.info("Appending path '%s' to Glue crawler include path.", output_loc)
    crawler["Targets"]["S3Targets"] = s3_targets
    return crawler


def remove_excessive_keys(crawler):
    """Remove keys from Glue crawler dict that are not needed/allowed to update the crawler"""
    for k in ["State", "CrawlElapsedTime", "CreationTime", "LastUpdated", "LastCrawl", "Version"]:
        try:
            del crawler[k]
        except KeyError:
            logging.warning(f"Key '{k}' not in crawler result dictionary.")
    return crawler


if __name__ == "__main__":
    logging.info(f"Transforming from {input_loc} to {output_loc}.")
    if prefix_exists(curated_zone_bucket_name, curated_zone_key):
        logging.info("Target object already exists, appending.")
    else:
        logging.info("Target object doesn't exist, writing to new one.")
    transform() # do data transformation and write to output bucket
    while True:
        try:
            crawler = get_crawler(CRAWLER_NAME)
            crawler = update_target_paths(crawler)
            crawler = remove_excessive_keys(crawler)

            # Update Glue crawler with updated include paths
            glue_client.update_crawler(**crawler)

            glue_client.start_crawler(Name=CRAWLER_NAME)
            logging.info("Started Glue crawler '%s'.", CRAWLER_NAME)
            break
        except (
            glue_client.exceptions.CrawlerRunningException,
            glue_client.exceptions.InvalidInputException,
        ):
            logging.warning("Crawler still running...")
            time.sleep(10)

Variables defined defined globally: input_loc , output_loc , CRAWLER_NAME , bucket_name .全局定义的变量： input_loc 、 output_loc 、 CRAWLER_NAME 、 bucket_name 。

For every new data set a new path is added to the Glue crawler.对于每个新数据集，都会将一个新路径添加到 Glue 爬虫中。 No partitions will be created.不会创建任何分区。

强制 Glue Crawler 创建单独的表

问题描述

2 个解决方案

解决方案1
2 2022-03-14 10:27:26

解决方案2
1 已采纳 2022-03-17 09:08:11

强制 Glue Crawler 创建单独的表

问题描述

2 个解决方案

解决方案1 2 2022-03-14 10:27:26

解决方案2 1 已采纳 2022-03-17 09:08:11

解决方案1
2 2022-03-14 10:27:26

解决方案2
1 已采纳 2022-03-17 09:08:11