简体繁体 English

在执行ETL作业之前是否需要运行AWS Glue搜寻器以检测新数据？

[英]Is it required to run AWS Glue crawler to detect new data before executing an ETL job?

原文 2018-04-11 13:35:42 5 2 amazon-web-services/ aws-glue

AWS Glue docs clearly states that Crawlers scrapes metadata information from the source (JDBS or s3) and populates Data Catalog (creates/updates DB and corresponding tables). AWS Glue文档明确指出，Crawlers从源（JDBS或s3）中抓取元数据信息，并填充数据目录（创建/更新数据库和相应的表）。

However, it's not clear whether we need to run a crawler regularly to detect new data in a source (ie, new objects on s3, new rows in db table) if we know that there no scheme/partitioning changes. 但是，如果我们知道没有方案/分区更改，是否需要定期运行搜寻器以检测源中的新数据（即s3上的新对象，db表中的新行）还不清楚。

So, is it required to run a crawler prior to running an ETL job to be able to pick up a new data? 因此，是否需要在运行ETL作业之前运行搜寻器以能够提取新数据？

2 个解决方案

AWS Glue will automatically detect new data in S3 buckets so long as it's within your existing folders (partitions). 只要数据在现有文件夹（分区）中，AWS Glue就会自动检测S3存储桶中的新数据。

If data is added to new folders (partitions), you need to reload your partitions using MSCK REPAIR TABLE mytable; 如果将数据添加到新文件夹（分区），则需要使用MSCK REPAIR TABLE mytable;重新加载分区MSCK REPAIR TABLE mytable; . 。

It's necessary to run the crawler prior to the job. 在作业之前必须运行搜寻器。

The crawler replaces Athena MSCK REPAIR TABLE and also updates the table with new columns as they're added. 搜寻器替换了Athena MSCK REPAIR TABLE，并且还在添加新列时用新列更新了该表。

创建 AWS 粘合作业是否需要爬网程序？ - Is crawler required for creating an AWS glue job?

AWS Glue ETL作业缺少对搜寻器可见的字段 - AWS Glue ETL job missing fields visible to crawler

使用 AWS Glue 爬网程序/分类器/ETL 作业将带有数组的 JSON 展平 - Flatten JSON with array using AWS Glue crawler / classifier / ETL job

AWS Glue 从 ETL 作业加载新分区失败 - AWS Glue load new partitions from ETL job fails

有没有办法在工作完成后运行 aws 胶水爬虫？ - Is there a way to run aws glue crawler after job is finished?

AWS Glue ETL作业如何检索数据？ - How does AWS Glue ETL job retrieve data?

AWS Glue Crawler 在没有 Glue Job 的情况下将所有数据发送到 Glue Catalog 和 Athena - AWS Glue Crawler sends all data to Glue Catalog and Athena without Glue Job

无法填充 AWS Glue ETL 作业指标 - Not able to populate AWS Glue ETL Job metrics

使用日期作为变量为 ETL 参数化 AWS Glue 作业 - Parameterize AWS Glue Job for ETL with Date as variables

ETL：在AWS胶粘作业中展平嵌套数组 - ETL : Flatten a nested array in an AWS glue job

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 创建 AWS 粘合作业是否需要爬网程序？ - Is crawler required for creating an AWS glue job? AWS Glue ETL作业缺少对搜寻器可见的字段 - AWS Glue ETL job missing fields visible to crawler 使用 AWS Glue 爬网程序/分类器/ETL 作业将带有数组的 JSON 展平 - Flatten JSON with array using AWS Glue crawler / classifier / ETL job AWS Glue 从 ETL 作业加载新分区失败 - AWS Glue load new partitions from ETL job fails 有没有办法在工作完成后运行 aws 胶水爬虫？ - Is there a way to run aws glue crawler after job is finished? AWS Glue ETL作业如何检索数据？ - How does AWS Glue ETL job retrieve data? AWS Glue Crawler 在没有 Glue Job 的情况下将所有数据发送到 Glue Catalog 和 Athena - AWS Glue Crawler sends all data to Glue Catalog and Athena without Glue Job 无法填充 AWS Glue ETL 作业指标 - Not able to populate AWS Glue ETL Job metrics 使用日期作为变量为 ETL 参数化 AWS Glue 作业 - Parameterize AWS Glue Job for ETL with Date as variables ETL：在AWS胶粘作业中展平嵌套数组 - ETL : Flatten a nested array in an AWS glue job

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM