简体   繁体   中英

Move data from Postgres/MySQL to S3 using Airflow

We are trying to move from Pentaho Kettle, to Apache AIrflow to do ETL and centralize all data processes under 1 tool.

We use Kettle to daily read data from Postgres/Mysql databases, and move the data to S3 -> Redshift.

What is the easiest way to do this? I do not see Operator that could directly do this; so Should i use MySQL/Postgres operator to put data in a local file, and the use S3 operator to move data to S3?

Thank you

You can build your own operator 'mysql_to_s3' and add it as a plugin to Airflow.

There is an operator to archive data from Mysql to gcs:

mysql_to_gcs.py

You can let all the code with a little change on def _upload_to_gcs using s3_hook instead: s3_hook.py .

Documentation about custom plugins:

Airflow plugins: Blog article

Airflow plugins: Official documentation

airflow-plugins (by Astronomer) has a MySqlToS3Operator that will take the resultset of a mysql query and place it on s3 as either csv or json.

The plugin can be found here: https://github.com/airflow-plugins/mysql_plugin/blob/master/operators/mysql_to_s3_operator.py

From there you might be able to use the s3_to_redshift operator to load data from S3 into redshift: https://airflow.readthedocs.io/en/latest/_modules/airflow/operators/s3_to_redshift_operator.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM