简体   繁体   中英

How to export large data from Postgres to S3 using Cloud composer?

I have been using the Postgres to S3 operator to load data from Postgres to S3. But recently, I had to export a very large table and my Airflow composer fails without any logs, this could be because we are using the NamedTemporaryFile function of Python's tempfile module to create a temporary file and we are using this temporary file to load to S3. Since we are using Composer, this will be loaded to Composer's local memory, and since the size of the file is very large, it is failing.

Refer here: https://cloud.google.com/composer/docs/how-to/using/troubleshooting-dags#task_fails_without_emitting_logs

I did check the RedshiftToS3 operator since that was also using a Postgres hook and it had several unload options which could easily load large files but I realised there is no 1-1 correspondence between Redshift and Postgres. So that is not possible. Is there any way I can split my Postgres query? Right now I'm doing SELECT * FROM TABLENAME Also, I do not have any information about the table.

I also came across this similar operator: https://airflow.apache.org/docs/stable/_modules/airflow/contrib/operators/sql_to_gcs.html

Here there is a param approx_max_file_size_bytes :

This operator supports the ability to split large table dumps into multiple files (see notes in the filename param docs above). This param allows developers to specify the file size of the splits.

What I understood from the code is that they are creating a new temporary file when the size exceeds the given limit, so are they splitting the file into multiple temp files and then uploading separately?

EDIT: I will again explain exactly what I'm trying to do. Currently, the Postgres to S3 operator creates a temp file and writes all the results returned by the cursor to this file, and that is causing memory issue. So what I'm thinking is, I could add a max_file_size limit and for each row in cursor I will be writing the results to our temp file and if the size of our temp file exceeds max_file_size limit we set, we write the contents of our file to S3, then flush or delete this file and then create a new temp file and write the next row of cursor to this file and upload that file as well to S3. I'm not sure how to modify the operator like that?

As you've figured out already it's because you are building up a dictionary with every row in the table, when you have many rows in your table you run out of memory on the machine.

You've already answered your own question really: only write a until the file reaches a certain size then push the file to S3. Alternatively you could just keep the file on disk and flush the dictionary object every x rows but your file could grow very large on disk rather than in memory in that case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM