How to overcome the 2hr connection timeout (OperationalError) using SQLAlchemy and Postgres?

Question

I'm trying to execute some long-running SQL queries using SQLAlchemy against a Postgres database hosted on AWS RDS.

from sqlalchemy import create_engine
conn_str = 'postgresql://user:password@db-primary.cluster-cxf.us-west-2.rds.amazonaws.com:5432/dev'
engine = create_engine(conn_str)

sql = 'UPDATE "Clients" SET "Name" = NULL'
#this takes about 4 hrs to execute if run in pgAdmin
with engine.begin() as conn:
    conn.execute(sql)

After running for exactly 2 hours, the script errors out with

OperationalError: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

(Background on this error at: https://sqlalche.me/e/14/e3q8)

I have tested setting connection timeouts in SQLAlchemy (based on How to set connection timeout in SQLAlchemy ). This did not make a difference.

I have looked up the connection settings in the Postgres settings (based on https://dba.stackexchange.com/questions/164419/is-it-possible-to-limit-timeout-on-postgres-server ), but both statement_timeout and idle_in_transaction_session_timeout are set to 0, meaning there are no set limits.

Answer 1

I agree with @jjanes. This smells like a TCP connection timeout issue. Might be that somewhere in the.network layer something, be it a NAT or a firewall, dropped your TCP connection, leaving the code to wait for the full TCP keepalive timeout until it sees the connection as closed. This could happen usually when the.network topology between the client and the database is complicated. For example there may be a company firewall, or some sort of interconnection. pgAdmin may come with a pre-configured setting for TCP keepalive, therefore it was not impacted, but I'm not sure.

Other timeouts didn't kick in because, in my understanding, TCP timeout is in the L4 layer, which overshadows other timeouts that are in L7 application layer.

You could try adding the keepalive parameters into your connection string and see if it can resolve the issue. For example:

postgresql://user:password@db-primary.cluster-cxf.us-west-2.rds.amazonaws.com:5432/dev?keepalives_idle=1&keepalives_count=1&tcp_user_timeout=1000

Note the keepalive parameters at the end. For your reference, here's the explanation to those parameters: https://www.postgresql.org/docs/current/runtime-config-connection.html

How to overcome the 2hr connection timeout (OperationalError) using SQLAlchemy and Postgres?

Question

1 answers

solution1
0 2023-01-05 04:31:24

How to overcome the 2hr connection timeout (OperationalError) using SQLAlchemy and Postgres?

Question

1 answers

solution1 0 2023-01-05 04:31:24

solution1
0 2023-01-05 04:31:24