简体   繁体   中英

How can I stop an active query on a postgres coordinator when the worker node has crashed

I have a postgres select query that can not be stopped using the standard pg_cancel/pg_terminate commands. Both commands return true, but do nothing. The query has access share locks on hundreds of tables, making it impossible for our ETL's to create new partitions. Query is listed as active but we believe is simply waiting for a response from the worker node which will never be sent.

The database can not be brought down due to the nature of data ingestion. We do not have the ability to use TCP Spoofing to send a signal to the Coordinator as that has been blocked at the o/s level.

System details: Linux: Red Hat Enterprise Linux Server 7.4 (Maipo) Pg: 10.6 citus: 8.1.1

pg_stat_activity for query in question pgStat_activity

Our team has done the following:

  • select pg_cancel_backend(30334);
  • select pg_terminate_backend(30334);

At linux command prompt: kill 30334

Both queries return TRUE, the command line does nothing and the session persists Even if we were to attempt to stop the postgres database, we are afraid the system might wait on the query to finish.

Looking for suggestions that don't involve kill -9. Anyone come across this problem before?

We see this happening on systems where TCP keepalive is wrongly configured. The technical details are complicated, but when you run into these symptoms Citus is waiting for a response on the socket to the worker. Due to misconfigured TCP Keepalive, the linux kernel is keeping the socket open even though the remote has already disappeared. TCP Keepalive will cause messages to flow over TCP sockets that have no natural packets moving which will cause the detection of the broken socket and a subsequent close signal being passed to the Citus extension that is waiting for a response.

To answer your question, if you cannot restart the coordinator you will need to manually connect GDB to the backend that is in this state. From the backtrace you can poke around and find the file descriptor that the process is blocked on, you might want to install the debug symbols for both Postgresql and Citus. Calling close ( https://linux.die.net/man/2/close ) on this filedescriptor will help unblock the backend. Alternatively sometimes the FD can be found with strace.

Lastly you will want to enable and configure TCP Keepalive in a way where the kernel does send tcp keepalive messages and closes the socket for you when the other end disappears. Pointers can be found here: https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM