简体繁体中英

Is it required to run AWS Glue crawler to detect new data before executing an ETL job?

原文 2018-04-11 13:35:42 5 2 amazon-web-services/ aws-glue

AWS Glue docs clearly states that Crawlers scrapes metadata information from the source (JDBS or s3) and populates Data Catalog (creates/updates DB and corresponding tables).

However, it's not clear whether we need to run a crawler regularly to detect new data in a source (ie, new objects on s3, new rows in db table) if we know that there no scheme/partitioning changes.

So, is it required to run a crawler prior to running an ETL job to be able to pick up a new data?

2 answers

AWS Glue will automatically detect new data in S3 buckets so long as it's within your existing folders (partitions).

If data is added to new folders (partitions), you need to reload your partitions using MSCK REPAIR TABLE mytable; .

It's necessary to run the crawler prior to the job.

The crawler replaces Athena MSCK REPAIR TABLE and also updates the table with new columns as they're added.

Is crawler required for creating an AWS glue job?

AWS Glue ETL job missing fields visible to crawler

Flatten JSON with array using AWS Glue crawler / classifier / ETL job

AWS Glue load new partitions from ETL job fails

Is there a way to run aws glue crawler after job is finished?

How does AWS Glue ETL job retrieve data?

AWS Glue Crawler sends all data to Glue Catalog and Athena without Glue Job

Not able to populate AWS Glue ETL Job metrics

Parameterize AWS Glue Job for ETL with Date as variables

ETL : Flatten a nested array in an AWS glue job

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Is crawler required for creating an AWS glue job? AWS Glue ETL job missing fields visible to crawler Flatten JSON with array using AWS Glue crawler / classifier / ETL job AWS Glue load new partitions from ETL job fails Is there a way to run aws glue crawler after job is finished? How does AWS Glue ETL job retrieve data? AWS Glue Crawler sends all data to Glue Catalog and Athena without Glue Job Not able to populate AWS Glue ETL Job metrics Parameterize AWS Glue Job for ETL with Date as variables ETL : Flatten a nested array in an AWS glue job

Related Tags

Is it required to run AWS Glue crawler to detect new data before executing an ETL job?

Question

2 answers

solution1
1 ACCPTED 2018-04-15 11:54:55

solution2
0 2018-04-18 01:44:05

Is it required to run AWS Glue crawler to detect new data before executing an ETL job?

Question

2 answers

solution1 1 ACCPTED 2018-04-15 11:54:55

solution2 0 2018-04-18 01:44:05

solution1
1 ACCPTED 2018-04-15 11:54:55

solution2
0 2018-04-18 01:44:05