python subprocess and mysqldump

Question

I know parts of this question have been asked before, but I have some related questions.

I'm trying to execute

mysqldump -u uname -ppassword --add-drop-database --databases databaseName | gzip > fileName

I'm potentially dumping a very large (200GB?) db. Is that in itself a dumb thing to do? I then want to send the zipped file over the network for storage, delete the local dump, and purge a couple of tables.

Anyway, I was using subprocess like this, because there doesn't seem to be a way to execute the entire original call without subprocess considering | to be a table name.:

from subprocess import Popen, PIPE

f = open(FILENAME, 'wb')
args = ['mysqldump', '-u', 'UNAME', '-pPASSWORD', '--add-drop-database', '--databases', 'DB']

p1 = Popen(args, stdout=PIPE)
P2 = Popen('gzip', stdin=p1.stdout, stdout=f)
p2.communicate()

but then I read that communicate caches the data in memory, which wouldn't work for me. Is this true?

What I ended up doing for now is:

import gzip
subprocess.call(args, stdout=f)
f.close()

f = open(filename, 'rb')
zipFilename = filename + '.gz'
f2 = gzip.open(zipFilename, 'wb')
f2.writelines(f)
f2.close()
f.close()

of course this takes a million years, and I hate it.

My Questions: 1. Can I use my first approach on a very large db? 2. Could I possibly pipe the output of mysqldump to a socket and fire it across the network and save it when it arrives, rather than sending a zipped file?

Thanks!

Answer 1

You don't need communicate(). Its only there as a convenience method if you want to read stdout/stderr to completion. But since you are chaining the commands, they are doing that for you. Just wait for them to complete.

from subprocess import Popen, PIPE

args = ['mysqldump', '-u', 'UNAME', '-pPASSWORD', '--add-drop-database', '--databases', 'DB']

with open(FILENAME, 'wb', 0) as f:
    p1 = Popen(args, stdout=PIPE)
    p2 = Popen('gzip', stdin=p1.stdout, stdout=f)
p1.stdout.close() # force write error (/SIGPIPE) if p2 dies
p2.wait()
p1.wait()

Answer 2

Your example code using two subprocess.Popen calls is correct (albeit slightly-improve-able), and this:

... I read that communicate caches the data in memory

is also correct—it reads into memory all of the standard-output and standard-error-output that the "communicating command" produces on a subprocess.PIPE —but not a problem here , because you have this:

p1 = Popen(args, stdout=PIPE)
P2 = Popen('gzip', stdin=p1.stdout, stdout=f)
p2.communicate()

You're calling communicate() on p2 , whose stdout output is sent to f (an opened file), and whose stderr output—which is probably empty anyway (no errors occur)—is not being sent to a PIPE . Thus, p2.communicate() would at worst have to read and buffer-up a grand total of zero bytes of stdout plus zero bytes of stderr. It's actually a bit more clever, noticing that there is no PIPE , so it returns the tuple (None, None) .

If you were to call p1.communicate() , that would be more of an issue (although in this case you'd then be fighting with p2 , the gzip process, for the output from p1 , which would be even worse). But you are not; p1 's output flows to p2 , and p2 's output flows to a file.

Since none of p2 's output is sent to a PIPE , there is no need to call p2.communicate() here: you can simply call p2.wait() . That makes it clearer that there's no data flowing back from p2 (which I would say is a minor improvement to the code, although if you decide you want to capture p2 's stderr after all, you'd have to change that back).

Edit to add: as in glglgl's answer, it's important to close p1 's pipe to p2 after creating p2, otherwise p2 will wait for your Python process to send data to p2 , too.

Answer 3

You are quite close to where you want:

from subprocess import Popen, PIPE

f = open(FILENAME, 'wb')
args = ['mysqldump', '-u', 'UNAME', '-pPASSWORD', '--add-drop-database', '--databases', 'DB']

p1 = Popen(args, stdout=PIPE)

Till here it is right.

p2 = Popen('gzip', stdin=p1.stdout, stdout=PIPE)

This one takes p1 's output and processes it. Afterwards we can (and should) immediately p1.stdout.close() .

Now we have a p2.stdout which can be read from and, without using a temporary file, send it via the network:

s = socket.create_connection(('remote_pc', port))
while True:
    r = p2.stdout.read(65536)
    if not r: break
    s.send(r)

Answer 4

Yup the data is buffered in memory:

"Note The data read is buffered in memory, so do not use this method if the data size is large or unlimited." - subprocess docs

Unfortunately at the moment there is no way to asynchronously use Popen: PEP3145

Rather than doing this all in python you can manually do

os.system("mysqldump -u uname -ppassword --add-drop-database --databases databaseName | gzip > fileName

")

with the appropriate string replacements using string.format of course; otherwise you're putting an unnecessary amount of stress on your computer, especially trying to communicate 200gb via a pipe ...

Can you elaborate on what you are trying to do? Right now it sounds like you're both dumping and zipping on the same computer.

Yes you can stream a file across the network .. I don't know if you want to directly stream the output of mysql directly though - you might want to look at your network capabilities before considering that

bash:

#!/bin/bash
mysqldump -u uname -ppassword --add-drop-database --databases databaseName | gzip > fileName
 #transfer fileName to other computer

^ you can also put this in a crontab and have it run at intervals :)

python subprocess and mysqldump

Question

4 answers

solution1
6 ACCPTED 2013-07-26 20:22:22

solution2
2 2013-07-26 20:22:30

solution3
2 2013-07-26 20:37:29

solution4
1 2013-07-26 19:36:12

python subprocess and mysqldump

Question

4 answers

solution1 6 ACCPTED 2013-07-26 20:22:22

solution2 2 2013-07-26 20:22:30

solution3 2 2013-07-26 20:37:29

solution4 1 2013-07-26 19:36:12

solution1
6 ACCPTED 2013-07-26 20:22:22

solution2
2 2013-07-26 20:22:30

solution3
2 2013-07-26 20:37:29

solution4
1 2013-07-26 19:36:12