We have a simple ETL setup below
This works fine most of the times, but in some cases our vendor might upload files consecutively in a short time period, for example when refreshing history data. This will cause an issue since glue crawler cannot run concurrently and the job will fail.
I'm wondering if there is anything we can do to avoid the potential error. I've looked into SQS
but not exactly sure if this can help me, below is what I would like to achieve:
Is this doable with S3 and SQS?
SQS hold the event,
Yes, you can do this, as you can setup SQS delay to up to 15 minues.
wait until there has been no other following event for a given time period, say 5 minutes.
No, there is not automated way for that. You have to develop your own custom solution . The most trivial way would be to not bundle SQS with lambda, and instead have lambda running on schedule (eg every 5 minutes). Lambda would have to have logic to determine if there are no new files uploaded after some time, and then trigger your Glue Job. Probably this would involve DynamoDB to keep track of last uploaded files between lambda executions.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.