简体繁体中英

How to export large data from Postgres to S3 using Cloud composer?

原文 2020-01-30 07:22:31 5 1 python/ postgresql/ airflow/ google-cloud-composer

I have been using the Postgres to S3 operator to load data from Postgres to S3. But recently, I had to export a very large table and my Airflow composer fails without any logs, this could be because we are using the NamedTemporaryFile function of Python's tempfile module to create a temporary file and we are using this temporary file to load to S3. Since we are using Composer, this will be loaded to Composer's local memory, and since the size of the file is very large, it is failing.

Refer here: https://cloud.google.com/composer/docs/how-to/using/troubleshooting-dags#task_fails_without_emitting_logs

I did check the RedshiftToS3 operator since that was also using a Postgres hook and it had several unload options which could easily load large files but I realised there is no 1-1 correspondence between Redshift and Postgres. So that is not possible. Is there any way I can split my Postgres query? Right now I'm doing SELECT * FROM TABLENAME Also, I do not have any information about the table.

I also came across this similar operator: https://airflow.apache.org/docs/stable/_modules/airflow/contrib/operators/sql_to_gcs.html

Here there is a param approx_max_file_size_bytes :

This operator supports the ability to split large table dumps into multiple files (see notes in the filename param docs above). This param allows developers to specify the file size of the splits.

What I understood from the code is that they are creating a new temporary file when the size exceeds the given limit, so are they splitting the file into multiple temp files and then uploading separately?

EDIT: I will again explain exactly what I'm trying to do. Currently, the Postgres to S3 operator creates a temp file and writes all the results returned by the cursor to this file, and that is causing memory issue. So what I'm thinking is, I could add a max_file_size limit and for each row in cursor I will be writing the results to our temp file and if the size of our temp file exceeds max_file_size limit we set, we write the contents of our file to S3, then flush or delete this file and then create a new temp file and write the next row of cursor to this file and upload that file as well to S3. I'm not sure how to modify the operator like that?

1 answers

As you've figured out already it's because you are building up a dictionary with every row in the table, when you have many rows in your table you run out of memory on the machine.

You've already answered your own question really: only write a until the file reaches a certain size then push the file to S3. Alternatively you could just keep the file on disk and flush the dictionary object every x rows but your file could grow very large on disk rather than in memory in that case.

Move data from Postgres/MySQL to S3 using Airflow

How to invoke gsutil or use path of GCS objects to move data from GCS to s3 bucket using cloud function

Using python, standard approach to load data from S3 to AWS RDS Postgres?

how to export xml file from a s3 bucket using lambda(python) into tables in an oracle database

Load large flat file based data in warehouse from s3

Transfering data from gcs to s3 with google-cloud-storage

I need to scrape logs from cloud watch logs and load it to s3 and from s3 to data warehouse

How to write data to RDS from S3 using boto3

How to read data from S3 using python in Azure ML

How to stream very large file to s3 from url using boto3 and Python?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Move data from Postgres/MySQL to S3 using Airflow How to invoke gsutil or use path of GCS objects to move data from GCS to s3 bucket using cloud function Using python, standard approach to load data from S3 to AWS RDS Postgres? how to export xml file from a s3 bucket using lambda(python) into tables in an oracle database Load large flat file based data in warehouse from s3 Transfering data from gcs to s3 with google-cloud-storage I need to scrape logs from cloud watch logs and load it to s3 and from s3 to data warehouse How to write data to RDS from S3 using boto3 How to read data from S3 using python in Azure ML How to stream very large file to s3 from url using boto3 and Python?

Related Tags

How to export large data from Postgres to S3 using Cloud composer?

Question

1 answers

solution1 1 2020-02-02 19:18:27

solution1
1 2020-02-02 19:18:27