简体   繁体   English

在执行ETL作业之前是否需要运行AWS Glue搜寻器以检测新数据?

[英]Is it required to run AWS Glue crawler to detect new data before executing an ETL job?

AWS Glue docs clearly states that Crawlers scrapes metadata information from the source (JDBS or s3) and populates Data Catalog (creates/updates DB and corresponding tables). AWS Glue文档明确指出,Crawlers从源(JDBS或s3)中抓取元数据信息,并填充数据目录(创建/更新数据库和相应的表)。

However, it's not clear whether we need to run a crawler regularly to detect new data in a source (ie, new objects on s3, new rows in db table) if we know that there no scheme/partitioning changes. 但是,如果我们知道没有方案/分区更改,是否需要定期运行搜寻器以检测源中的新数据(即s3上的新对象,db表中的新行)还不清楚。

So, is it required to run a crawler prior to running an ETL job to be able to pick up a new data? 因此,是否需要在运行ETL作业之前运行搜寻器以能够提取新数据?

AWS Glue will automatically detect new data in S3 buckets so long as it's within your existing folders (partitions). 只要数据在现有文件夹(分区)中,AWS Glue就会自动检测S3存储桶中的新数据。

If data is added to new folders (partitions), you need to reload your partitions using MSCK REPAIR TABLE mytable; 如果将数据添加到新文件夹(分区),则需要使用MSCK REPAIR TABLE mytable;重新加载分区MSCK REPAIR TABLE mytable; .

It's necessary to run the crawler prior to the job. 在作业之前必须运行搜寻器。

The crawler replaces Athena MSCK REPAIR TABLE and also updates the table with new columns as they're added. 搜寻器替换了Athena MSCK REPAIR TABLE,并且还在添加新列时用新列更新了该表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM