[英]Force Glue Crawler to create separate tables
I am continuously add parquet data sets to an S3 folder with a structure like this:我不断地将镶木地板数据集添加到结构如下的 S3 文件夹中:
s3:::my-bucket/public/data/set1
s3:::my-bucket/public/data/set2
s3:::my-bucket/public/data/set3
At the beginning I only have set1
and my crawler is configured to run on the whole bucket s3:::my-bucket
.一开始我只有
set1
并且我的爬虫被配置为在整个桶s3:::my-bucket
上运行。 This leads to the creation of a partitioned tabled named my-bucket
with partitions named public
, data
and set1
.这导致创建一个名为
my-bucket
的分区表,其中分区名为public
、 data
和set1
。 What I actually want is to have a table named set1
without any partitions.我真正想要的是有一个没有任何分区的名为
set1
的表。 I see the reasons why this happens, as it is explained under How Does a Crawler Determine When to Create Partitions?我明白了发生这种情况的原因,正如爬虫如何确定何时创建分区? .
. But when a new data set is uploaded (eg
set2
) I don't want it to be another partition (because it is completely different data with a different schema).但是当一个新的数据集被上传时(例如
set2
)我不希望它成为另一个分区(因为它是具有不同模式的完全不同的数据)。 How can I force the Glue crawler to NOT create partitions?如何强制 Glue 爬虫不创建分区? I know I could define the crawler path as
s3:::my-bucket/public/data/
but unfortunately I don't know where the new data sets will be created (eg could also be s3:::my-bucket/other/folder/set2
).我知道我可以将爬虫路径定义为
s3:::my-bucket/public/data/
但不幸的是我不知道新数据集将在哪里创建(例如也可以是s3:::my-bucket/other/folder/set2
)。
Any ideas how to solve this?任何想法如何解决这个问题?
My solution was to manually add the specific paths to the Glue crawler.我的解决方案是手动将特定路径添加到 Glue 爬虫中。 The big picture is that I am using a Glue job to transform data from one S3 bucket and write it to another one.
总体情况是,我正在使用 Glue 作业将数据从一个 S3 存储桶转换并写入另一个存储桶。 I now ended up to initially configure the Glue crawler to crawl the whole bucket.
我现在最终将 Glue 爬虫初步配置为爬取整个存储桶。 But every time the Glue transformation job runs it also updates the Glue crawler: it removes the initial full bucket location (if it still exists) and then adds the new path to the S3 targets.
但每次 Glue 转换作业运行时,它也会更新 Glue 爬虫:它会删除初始的完整存储桶位置(如果它仍然存在),然后将新路径添加到 S3 目标。
In Python it looks something like this:在 Python 中,它看起来像这样:
def update_target_paths(crawler):
"""
Remove initial include path (whole bucket) from paths and
add folder for current files to include paths.
"""
def path_is(c, p):
return c["Path"] == p
# get S3 targets and remove initial bucket target
s3_targets = list(
filter(
lambda c: not path_is(c, f"s3://{bucket_name}"),
crawler["Targets"]["S3Targets"],
)
)
# add new target path if not in targets yet
if not any(filter(lambda c: path_is(c, output_loc), s3_targets)):
s3_targets.append({"Path": output_loc})
logging.info("Appending path '%s' to Glue crawler include path.", output_loc)
crawler["Targets"]["S3Targets"] = s3_targets
return crawler
def remove_excessive_keys(crawler):
"""Remove keys from Glue crawler dict that are not needed/allowed to update the crawler"""
for k in ["State", "CrawlElapsedTime", "CreationTime", "LastUpdated", "LastCrawl", "Version"]:
try:
del crawler[k]
except KeyError:
logging.warning(f"Key '{k}' not in crawler result dictionary.")
return crawler
if __name__ == "__main__":
logging.info(f"Transforming from {input_loc} to {output_loc}.")
if prefix_exists(curated_zone_bucket_name, curated_zone_key):
logging.info("Target object already exists, appending.")
else:
logging.info("Target object doesn't exist, writing to new one.")
transform() # do data transformation and write to output bucket
while True:
try:
crawler = get_crawler(CRAWLER_NAME)
crawler = update_target_paths(crawler)
crawler = remove_excessive_keys(crawler)
# Update Glue crawler with updated include paths
glue_client.update_crawler(**crawler)
glue_client.start_crawler(Name=CRAWLER_NAME)
logging.info("Started Glue crawler '%s'.", CRAWLER_NAME)
break
except (
glue_client.exceptions.CrawlerRunningException,
glue_client.exceptions.InvalidInputException,
):
logging.warning("Crawler still running...")
time.sleep(10)
Variables defined defined globally: input_loc
, output_loc
, CRAWLER_NAME
, bucket_name
.全局定义的变量:
input_loc
、 output_loc
、 CRAWLER_NAME
、 bucket_name
。
For every new data set a new path is added to the Glue crawler.对于每个新数据集,都会将一个新路径添加到 Glue 爬虫中。 No partitions will be created.
不会创建任何分区。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.