強制 Glue Crawler 創建單獨的表

Question

我不斷地將鑲木地板數據集添加到結構如下的 S3 文件夾中：

s3:::my-bucket/public/data/set1
s3:::my-bucket/public/data/set2
s3:::my-bucket/public/data/set3

一開始我只有set1並且我的爬蟲被配置為在整個桶s3:::my-bucket上運行。 這導致創建一個名為my-bucket的分區表，其中分區名為public 、 data和set1 。 我真正想要的是有一個沒有任何分區的名為set1的表。 我明白了發生這種情況的原因，正如爬蟲如何確定何時創建分區？ . 但是當一個新的數據集被上傳時（例如set2 ）我不希望它成為另一個分區（因為它是具有不同模式的完全不同的數據）。 如何強制 Glue 爬蟲不創建分區？ 我知道我可以將爬蟲路徑定義為s3:::my-bucket/public/data/但不幸的是我不知道新數據集將在哪里創建（例如也可以是s3:::my-bucket/other/folder/set2 ）。

任何想法如何解決這個問題？

Answer 1

您可以使用TableLevelConfiguration指定搜尋器應在哪個文件夾級別查找表。

更多信息請點擊此處。

Answer 2

我的解決方案是手動將特定路徑添加到 Glue 爬蟲中。 總體情況是，我正在使用 Glue 作業將數據從一個 S3 存儲桶轉換並寫入另一個存儲桶。 我現在最終將 Glue 爬蟲初步配置為爬取整個存儲桶。 但每次 Glue 轉換作業運行時，它也會更新 Glue 爬蟲：它會刪除初始的完整存儲桶位置（如果它仍然存在），然后將新路徑添加到 S3 目標。

在 Python 中，它看起來像這樣：

def update_target_paths(crawler):
    """
    Remove initial include path (whole bucket) from paths and
    add folder for current files to include paths.
    """

    def path_is(c, p):
        return c["Path"] == p

    # get S3 targets and remove initial bucket target
    s3_targets = list(
        filter(
            lambda c: not path_is(c, f"s3://{bucket_name}"),
            crawler["Targets"]["S3Targets"],
        )
    )
    # add new target path if not in targets yet
    if not any(filter(lambda c: path_is(c, output_loc), s3_targets)):
        s3_targets.append({"Path": output_loc})
        logging.info("Appending path '%s' to Glue crawler include path.", output_loc)
    crawler["Targets"]["S3Targets"] = s3_targets
    return crawler


def remove_excessive_keys(crawler):
    """Remove keys from Glue crawler dict that are not needed/allowed to update the crawler"""
    for k in ["State", "CrawlElapsedTime", "CreationTime", "LastUpdated", "LastCrawl", "Version"]:
        try:
            del crawler[k]
        except KeyError:
            logging.warning(f"Key '{k}' not in crawler result dictionary.")
    return crawler


if __name__ == "__main__":
    logging.info(f"Transforming from {input_loc} to {output_loc}.")
    if prefix_exists(curated_zone_bucket_name, curated_zone_key):
        logging.info("Target object already exists, appending.")
    else:
        logging.info("Target object doesn't exist, writing to new one.")
    transform() # do data transformation and write to output bucket
    while True:
        try:
            crawler = get_crawler(CRAWLER_NAME)
            crawler = update_target_paths(crawler)
            crawler = remove_excessive_keys(crawler)

            # Update Glue crawler with updated include paths
            glue_client.update_crawler(**crawler)

            glue_client.start_crawler(Name=CRAWLER_NAME)
            logging.info("Started Glue crawler '%s'.", CRAWLER_NAME)
            break
        except (
            glue_client.exceptions.CrawlerRunningException,
            glue_client.exceptions.InvalidInputException,
        ):
            logging.warning("Crawler still running...")
            time.sleep(10)

全局定義的變量： input_loc 、 output_loc 、 CRAWLER_NAME 、 bucket_name 。

對於每個新數據集，都會將一個新路徑添加到 Glue 爬蟲中。 不會創建任何分區。

強制 Glue Crawler 創建單獨的表

問題描述

2 個解決方案

解決方案1
2 2022-03-14 10:27:26

解決方案2
1 已采納 2022-03-17 09:08:11

強制 Glue Crawler 創建單獨的表

問題描述

2 個解決方案

解決方案1 2 2022-03-14 10:27:26

解決方案2 1 已采納 2022-03-17 09:08:11

解決方案1
2 2022-03-14 10:27:26

解決方案2
1 已采納 2022-03-17 09:08:11