简体繁体 English

Aws Glue Crawler 在第一次爬网后没有更新表

[英]Aws Glue Crawler is not updating the table after 1st crawl

原文 2021-08-13 16:24:48 9 1 amazon-web-services/ aws-glue-data-catalog

I am adding a new file in parquet format which is created by a Glue Databrew in my S3 folder.我正在添加一个由 Glue Databrew 在我的 S3 文件夹中创建的镶木地板格式的新文件。 The new file has the same schema as the previous file.新文件与前一个文件具有相同的架构。 But when I am running the Crawler for the 2nd time it is neither updating the table nor creating a new one in the data catalog.但是当我第二次运行 Crawler 时，它既没有更新表也没有在数据目录中创建新表。 Also when I am crawling both the files together, both of them are getting added.此外，当我同时抓取这两个文件时，它们都会被添加。

Log File is giving the following information:日志文件提供以下信息：
INFO: Created partitions with values [[New file name]] for table信息：为表创建了值为 [[新文件名]] 的分区
BENCHMARK: Finished writing to Catalog BENCHMARK：完成写入目录

I have tried with and without "Create a single schema for each S3 path".我尝试过使用和不使用“为每个 S3 路径创建一个模式”。 But the crawler is not updating the table with the new file.但是爬虫没有用新文件更新表。 Sooner I will add new files on a daily basis to do my analysis.很快我就会每天添加新文件来做我的分析。 Any solution?任何解决方案？

1 个解决方案

The best way to approach this issue in my opinion is to use AWS DataBrew output to Data Catalog directly.在我看来，解决此问题的最佳方法是直接将 AWS DataBrew output 用于 Data Catalog。 Data Catalog can be updated either by the crawler or by DataBrew directly but the recommended practice is that you employ any one of those mechanisms not both. Data Catalog 可以由爬虫或 DataBrew 直接更新，但推荐的做法是您使用这些机制中的任何一种，而不是同时使用这两种机制。

Can you try running the job with output as your data catalog and let Databrew manage your catalog?您可以尝试使用 output 作为您的数据目录运行该作业并让 Databrew 管理您的目录吗？ It should update your catalog table with right data/files.它应该使用正确的数据/文件更新您的目录表。