简体   繁体   中英

Batch processing from S3 with Airflow

I receive messages from a feed which sends about 2 million messages per day. Each of these messages is pretty small and they are not needing to be dealt with in a streaming fashion so I batch process a full day of messages at a time.

The current infrastructure receives these messages, groups them together and uploads a gzipped file containing 1000 messages to AWS S3. The files are named as a datetime stamp of the format yyyymmdd-hhmmss. The batch process runs once a day (scheduled on Airflow) and it should select the new files from the bucket and process them. Currently, I am not using any hooks or sensors for this job.

My question is; what is the best way to collect the new files from S3 if the folder also contains all previous day files?

The inefficient solution I have is to pull down the list of files in the folder on S3 and process the files which have a filename matching the date I am processing for. My batch process is an Airflow DAG so I would like to maintain idempotency, meaning I don't want to remove files from this S3 folder after processing.

Ideally, I would like to select only the files with the datetime in the filename being after midnight the previous day (from execution date) and process them, without having to cycle thru the full list of files given this list will be growing continuously making each day slightly slower than the last.

Is there a better mechanism provided either through Airflow or Python to select files from S3? Or is there a method of performing such a task in a more efficient way?

Not knowing your full infrastructure, is it possible for you to partition by date at the point of uploading the gzipped files to s3?

Including a s3 folder prefix of /date=yyyymmdd/ would mean enable you to retrieve just that day's files and preserve idempotency.

When it comes to your airflow job, you pass the date as an argument to retrieve only those days in that s3 partition.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM