使用subprocess.Popen进行大输出的进程

Question

I have some Python code that executes an external app which works fine when the app has a small amount of output, but hangs when there is a lot. 我有一些Python代码执行外部应用程序，当应用程序有少量输出时工作正常，但有很多时挂起。 My code looks like: 我的代码看起来像：

p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
errcode = p.wait()
retval = p.stdout.read()
errmess = p.stderr.read()
if errcode:
    log.error('cmd failed <%s>: %s' % (errcode,errmess))

There are comments in the docs that seem to indicate the potential issue. 文档中的评论似乎表明了潜在的问题。 Under wait, there is: 等待，有：

Warning: This will deadlock if the child process generates enough output to a stdout or stderr pipe such that it blocks waiting for the OS pipe buffer to accept more data. 警告：如果子进程生成足够的输出到stdout或stderr管道，这会阻塞等待OS管道缓冲区接受更多数据，这将导致死锁。 Use communicate() to avoid that. 使用communicate()来避免这种情况。

though under communicate, I see: 虽然在沟通中，我看到：

Note The data read is buffered in memory, so do not use this method if the data size is large or unlimited. 注意读取的数据缓冲在内存中，因此如果数据大小很大或不受限制，请不要使用此方法。

So it is unclear to me that I should use either of these if I have a large amount of data. 因此我不清楚如果我有大量数据，我应该使用其中任何一种。 They don't indicate what method I should use in that case. 它们没有说明在这种情况下我应该使用什么方法。

I do need the return value from the exec and do parse and use both the stdout and stderr . 我确实需要exec的返回值并解析并使用stdout和stderr 。

So what is an equivalent method in Python to exec an external app that is going to have large output? 那么Python中用于执行具有大输出的外部应用程序的等效方法是什么？

Answer 1

You're doing blocking reads to two files; 你正在阻止对两个文件的读取; the first needs to complete before the second starts. 第一个需要在第二个开始之前完成。 If the application writes a lot to stderr , and nothing to stdout , then your process will sit waiting for data on stdout that isn't coming, while the program you're running sits there waiting for the stuff it wrote to stderr to be read (which it never will be--since you're waiting for stdout ). 如果应用程序向stderr写了很多内容，而stdout没有任何内容，那么你的进程将等待stdout上的数据没有到来，而你正在运行的程序就在那里等待它写入stderr的东西被读取（它永远不会 - 因为你正在等待stdout ）。

There are a few ways you can fix this. 有几种方法可以解决这个问题。

The simplest is to not intercept stderr ; 最简单的是不拦截stderr ; leave stderr=None . 离开stderr=None 。 Errors will be output to stderr directly. 错误将直接输出到stderr 。 You can't intercept them and display them as part of your own message. 您无法拦截它们并将它们显示为您自己的消息的一部分。 For commandline tools, this is often OK. 对于命令行工具，这通常没问题。 For other apps, it can be a problem. 对于其他应用程序，它可能是一个问题。

Another simple approach is to redirect stderr to stdout , so you only have one incoming file: set stderr=STDOUT . 另一个简单的方法是将stderr重定向到stdout ，因此您只有一个传入文件：set stderr=STDOUT 。 This means you can't distinguish regular output from error output. 这意味着您无法区分常规输出和错误输出。 This may or may not be acceptable, depending on how the application writes output. 取决于应用程序如何写入输出，这可能是也可能是不可接受的。

The complete and complicated way of handling this is select ( http://docs.python.org/library/select.html ). 处理这个问题的完整而复杂的方法是select （ http://docs.python.org/library/select.html ）。 This lets you read in a non-blocking way: you get data whenever data appears on either stdout or stderr . 这使您可以以非阻塞方式读取：只要数据出现在stdout或stderr上，就可以获得数据。 I'd only recommend this if it's really necessary. 如果真的有必要，我只会推荐这个。 This probably doesn't work in Windows. 这可能在Windows中不起作用。

Answer 2

Reading stdout and stderr independently with very large output (ie, lots of megabytes) using select : 使用select ，以非常大的输出（即大量的兆字节）独立读取stdout和stderr ：

import subprocess, select

proc = subprocess.Popen(cmd, bufsize=8192, shell=False, \
    stdout=subprocess.PIPE, stderr=subprocess.PIPE)

with open(outpath, "wb") as outf:
    dataend = False
    while (proc.returncode is None) or (not dataend):
        proc.poll()
        dataend = False

        ready = select.select([proc.stdout, proc.stderr], [], [], 1.0)

        if proc.stderr in ready[0]:
            data = proc.stderr.read(1024)
            if len(data) > 0:
                handle_stderr_data(data)

        if proc.stdout in ready[0]:
            data = proc.stdout.read(1024)
            if len(data) == 0: # Read of zero bytes means EOF
                dataend = True
            else:
                outf.write(data)

Answer 3

A lot of output is subjective so it's a little difficult to make a recommendation. 很多输出都是主观的，因此提出建议有点困难。 If the amount of output is really large then you likely don't want to grab it all with a single read() call anyway. 如果输出量非常大，那么您可能不希望通过单个read（）调用来获取所有内容。 You may want to try writing the output to a file and then pull the data in incrementally like such: 您可能想尝试将输出写入文件，然后逐步拉取数据，如下所示：

f=file('data.out','w')
p = subprocess.Popen(cmd, shell=True, stdout=f, stderr=subprocess.PIPE)
errcode = p.wait()
f.close()
if errcode:
    errmess = p.stderr.read()
    log.error('cmd failed <%s>: %s' % (errcode,errmess))
for line in file('data.out'):
    #do something

Answer 4

Glenn Maynard is right in his comment about deadlocks. 格伦梅纳德在关于僵局的评论中是正确的。 However, the best way of solving this problem is two create two threads, one for stdout and one for stderr, which read those respective streams until exhausted and do whatever you need with the output. 但是，解决这个问题的最好方法是创建两个线程，一个用于stdout，一个用于stderr，它们读取相应的流直到耗尽，并根据输出执行任何操作。

The suggestion of using temporary files may or may not work for you depending on the size of output etc. and whether you need to process the subprocess' output as it is generated. 根据输出的大小等因素以及是否需要在生成时处理子进程的输出，使用临时文件的建议可能适用于您，也可能不适用。

As Heikki Toivonen has suggested, you should look at the communicate method. 正如Heikki Toivonen建议的那样，你应该看一下communicate方法。 However, this buffers the stdout/stderr of the subprocess in memory and you get those returned from the communicate call - this is not ideal for some scenarios. 但是，这会将子进程的stdout / stderr缓存在内存中，并从communicate调用中返回 - 这对某些情况来说并不理想。 But the source of the communicate method is worth looking at. 但是沟通方法的来源值得关注。

Another example is in a package I maintain, python-gnupg , where the gpg executable is spawned via subprocess to do the heavy lifting, and the Python wrapper spawns threads to read gpg's stdout and stderr and consume them as data is produced by gpg. 另一个例子是我维护的包python-gnupg ，其中gpg可执行文件通过subprocess gpg生成以完成繁重的工作，并且Python包装器生成线程以读取gpg的stdout和stderr并在gpg生成数据时使用它们。 You may be able to get some ideas by looking at the source there, as well. 您也可以通过查看源代码来获得一些想法。 Data produced by gpg to both stdout and stderr can be quite large, in the general case. 在一般情况下，gpg对stdout和stderr生成的数据可能非常大。

Answer 5

I had the same problem. 我有同样的问题。 If you have to handle a large output, another good option could be to use a file for stdout and stderr, and pass those files per parameter. 如果你必须处理大输出，另一个好的选择可能是使用stdout和stderr的文件，并为每个参数传递这些文件。

Check the tempfile module in python: https://docs.python.org/2/library/tempfile.html . 检查python中的tempfile模块： https ： //docs.python.org/2/library/tempfile.html 。

Something like this might work 这样的事可能有用

out = tempfile.NamedTemporaryFile(delete=False)

Then you would do: 然后你会做：

Popen(... stdout=out,...)

Then you can read the file, and erase it later. 然后您可以读取该文件，稍后将其删除。

Answer 6

You could try communicate and see if that solves your problem. 您可以尝试沟通，看看是否能解决您的问题。 If not, I'd redirect the output to a temporary file. 如果没有，我会将输出重定向到临时文件。

Answer 7

Here is simple approach which captures both regular output plus error output, all within Python so limitations in stdout don't apply: 这是捕获常规输出和错误输出的简单方法，所有这些都在Python中，因此stdout限制不适用：

com_str = 'uname -a'
command = subprocess.Popen([com_str], stdout=subprocess.PIPE, shell=True)
(output, error) = command.communicate()
print output

Linux 3.11.0-20-generic SMP Fri May 2 21:32:55 UTC 2014

and 和

com_str = 'id'
command = subprocess.Popen([com_str], stdout=subprocess.PIPE, shell=True)
(output, error) = command.communicate()
print output

uid=1000(myname) gid=1000(mygrp) groups=1000(cell),0(root)

使用subprocess.Popen进行大输出的进程

问题描述

7 个解决方案

解决方案1
17 已采纳 2009-07-24 23:23:35

解决方案2
8 2016-12-02 09:49:55

解决方案3
6 2009-07-24 23:18:34

解决方案4
6 2009-07-25 19:14:53

解决方案5
3 2014-07-24 20:28:30

解决方案6
2 2009-07-24 23:24:10

解决方案7
0 2018-03-29 05:41:11

使用subprocess.Popen进行大输出的进程

问题描述

7 个解决方案

解决方案1 17 已采纳 2009-07-24 23:23:35

解决方案2 8 2016-12-02 09:49:55

解决方案3 6 2009-07-24 23:18:34

解决方案4 6 2009-07-25 19:14:53

解决方案5 3 2014-07-24 20:28:30

解决方案6 2 2009-07-24 23:24:10

解决方案7 0 2018-03-29 05:41:11

解决方案1
17 已采纳 2009-07-24 23:23:35

解决方案2
8 2016-12-02 09:49:55

解决方案3
6 2009-07-24 23:18:34

解决方案4
6 2009-07-25 19:14:53

解决方案5
3 2014-07-24 20:28:30

解决方案6
2 2009-07-24 23:24:10

解决方案7
0 2018-03-29 05:41:11