为什么subprocess.run输出与同一命令的shell输出不同？

Question

I am using subprocess.run() for some automated testing. 我正在使用subprocess.run()进行一些自动化测试。 Mostly to automate doing: 主要是自动执行：

dummy.exe < file.txt > foo.txt
diff file.txt foo.txt

If you execute the above redirection in a shell, the two files are always identical. 如果在shell中执行上述重定向，则这两个文件始终相同。 But whenever file.txt is too long, the below Python code does not return the correct result. 但是每当file.txt太长时，下面的Python代码都不会返回正确的结果。

This is the Python code: 这是Python代码：

import subprocess
import sys


def main(argv):

    exe_path = r'dummy.exe'
    file_path = r'file.txt'

    with open(file_path, 'r') as test_file:
        stdin = test_file.read().strip()
        p = subprocess.run([exe_path], input=stdin, stdout=subprocess.PIPE, universal_newlines=True)
        out = p.stdout.strip()
        err = p.stderr
        if stdin == out:
            print('OK')
        else:
            print('failed: ' + out)

if __name__ == "__main__":
    main(sys.argv[1:])

Here is the C++ code in dummy.cc : 这是dummy.cc的C ++代码：

#include <iostream>


int main()
{
    int size, count, a, b;
    std::cin >> size;
    std::cin >> count;

    std::cout << size << " " << count << std::endl;


    for (int i = 0; i < count; ++i)
    {
        std::cin >> a >> b;
        std::cout << a << " " << b << std::endl;
    }
}

file.txt can be anything like this: file.txt可以是这样的：

The second integer on the first line is the number of lines following, hence here file.txt will be 100,001 lines long. 第一行的第二个整数是后面的行数，因此这里file.txt长度为100,001行。

Question: Am I misusing subprocess.run() ? 问题：我是否误用了subprocess.run（）？

Edit 编辑

My exact Python code after comment (newlines,rb) is taken into account: 评论后我的确切Python代码（换行符，rb）被考虑在内：

import subprocess
import sys
import os


def main(argv):

    base_dir = os.path.dirname(__file__)
    exe_path = os.path.join(base_dir, 'dummy.exe')
    file_path = os.path.join(base_dir, 'infile.txt')
    out_path = os.path.join(base_dir, 'outfile.txt')

    with open(file_path, 'rb') as test_file:
        stdin = test_file.read().strip()
        p = subprocess.run([exe_path], input=stdin, stdout=subprocess.PIPE)
        out = p.stdout.strip()
        if stdin == out:
            print('OK')
        else:
            with open(out_path, "wb") as text_file:
                text_file.write(out)

if __name__ == "__main__":
    main(sys.argv[1:])

Here is the first diff: 这是第一个差异：

Here is the input file: https://drive.google.com/open?id=0B--mU_EsNUGTR3VKaktvQVNtLTQ 以下是输入文件： https ： //drive.google.com/open？id = 0B--mU_EsNUGTR3VKaktvQVNtLTQ

Answer 1

To reproduce, the shell command: 要重现，shell命令：

subprocess.run("dummy.exe < file.txt > foo.txt", shell=True, check=True)

without the shell in Python: 没有Python中的shell：

with open('file.txt', 'rb', 0) as input_file, \
     open('foo.txt', 'wb', 0) as output_file:
    subprocess.run(["dummy.exe"], stdin=input_file, stdout=output_file, check=True)

It works with arbitrary large files. 它适用于任意大文件。

You could use subprocess.check_call() in this case (available since Python 2), instead of subprocess.run() that is available only in Python 3.5+. 您可以在这种情况下使用subprocess.check_call() （自Python 2起可用），而不是仅在Python 3.5+中可用的subprocess.run() 。

Works very well thanks. 非常好，谢谢。 But then why was the original failing ? 但那么为什么原来失败了呢？ Pipe buffer size as in Kevin Answer ? 管道缓冲区大小与Kevin Answer一样？

It has nothing to do with OS pipe buffers. 它与OS管道缓冲区无关。 The warning from the subprocess docs that @Kevin J. Chase cites is unrelated to subprocess.run() . 来自@Kevin J. Chase引用的子流程文档的警告与subprocess.run subprocess.run()无关。 You should care about OS pipe buffers only if you use process = Popen() and manually read()/write() via multiple pipe streams ( process.stdin/.stdout/.stderr ). 只有在使用process = Popen()并通过多个管道流（ process.stdin/.stdout/.stderr ）手动读取（）/ write（）时，才应该关心OS管道缓冲区。

It turns out that the observed behavior is due to Windows bug in the Universal CRT . 事实证明，观察到的行为是由于Universal CRT中的Windows错误造成的。 Here's the same issue that is reproduced without Python: Why would redirection work where piping fails? 这是在没有Python的情况下重现的相同问题：为什么重定向会在管道失效的地方工作？

As said in the bug description , to workaround it: 如错误描述中所述，要解决它：

"use a binary pipe and do text mode CRLF => LF translation manually on the reader side" or use ReadFile() directly instead of std::cin “使用二进制管道并在阅读器端手动执行文本模式CRLF => LF翻译”或直接使用ReadFile()而不是std::cin
or wait for Windows 10 update this summer (where the bug should be fixed) 或等待今年夏天的Windows 10更新（应该修复bug）
or use a different C++ compiler eg, there is no issue if you use g++ on Windows 或者使用不同的C ++编译器，例如，如果在Windows上使用g++ ，则没有问题

The bug affects only text pipes ie, the code that uses <> should be fine ( stdin=input_file, stdout=output_file should still work or it is some other bug). 该bug只影响文本管道，即使用<>的代码应该没问题（ stdin=input_file, stdout=output_file应该仍然有效，或者是其他一些bug）。

Answer 2

I'll start with a disclaimer: I don't have Python 3.5 (so I can't use the run function), and I wasn't able to reproduce your problem on Windows (Python 3.4.4) or Linux (3.1.6). 我将从免责声明开始：我没有Python 3.5（所以我不能使用run函数），而且我无法在Windows（Python 3.4.4）或Linux（3.1。 6）。 That said... 那说......

Problems with `subprocess.PIPE` and Family `subprocess.PIPE`和Family的问题

The subprocess.run docs say that it's just a front-end for the old subprocess.Popen -and- communicate() technique. subprocess.run文档说它只是旧subprocess.Popen --and- communicate()技术的前端。 The subprocess.Popen.communicate docs warn that: subprocess.Popen.communicate文档警告：

The data read is buffered in memory, so do not use this method if the data size is large or unlimited. 读取的数据缓冲在内存中，因此如果数据大小很大或不受限制，请不要使用此方法。

This sure sounds like your problem. 这肯定听起来像你的问题。 Unfortunately, the docs don't say how much data is "large", nor what will happen after "too much" data is read. 不幸的是，文档没有说明“数据量”是多少，也不会说“读取太多”数据后会发生什么。 Just "don't do that, then". 只是“不要那样做，然后”。

The docs for subprocess.call go into a little more detail (emphasis mine)... 对于文档subprocess.call进入更详细一点（重点煤矿）...

Do not use stdout=PIPE or stderr=PIPE with this function. 不要在此函数中使用stdout=PIPE或stderr=PIPE 。 The child process will block if it generates enough output to a pipe to fill up the OS pipe buffer as the pipes are not being read from. 子进程将阻塞它是否为管道生成足够的输出以填充OS管道缓冲区，因为没有读取管道。

...as do the docs for subprocess.Popen.wait : ...和subprocess.Popen.wait的文档一样：

This will deadlock when using stdout=PIPE or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. 当使用stdout=PIPE或stderr=PIPE时，这将导致死锁，并且子进程会为管道生成足够的输出，以便阻止等待OS管道缓冲区接受更多数据。 Use Popen.communicate() when using pipes to avoid that. 使用管道时使用Popen.communicate()来避免这种情况。

That sure sounds like Popen.communicate is the solution to this problem, but communicate 's own docs say "do not use this method if the data size is large" --- exactly the situation where the wait docs tell you to use communicate . 这肯定听起来像Popen.communicate是解决这个问题，但是communicate自己的文档说---确切位置的情况，‘如果数据量很大不使用这种方法’ wait文档告诉你使用 communicate 。 (Maybe it "avoid(s) that" by silently dropping data on the floor?) （也许它通过静默地丢弃地板上的数据来“避免”）

Frustratingly, I don't see any way to use a subprocess.PIPE safely, unless you're sure you can read from it faster than your child process writes to it. 令人沮丧的是，我没有看到任何方法安全地使用subprocess.PIPE ，除非你确定你可以比你的子进程写入它更快地读取它。

On that note... 就此而言......

Alternative: `tempfile.TemporaryFile` 替代方案： `tempfile.TemporaryFile`

You're holding all your data in memory... twice, in fact. 事实上，你将所有数据都保存在内存中......两次。 That can't be efficient, especially if it's already in a file. 这可能效率不高，特别是如果它已经存在于文件中。

If you're allowed to use a temporary file, you can compare the two files very easily, one line at a time. 如果您被允许使用临时文件，则可以非常轻松地比较这两个文件，一次一行。 This avoids all the subprocess.PIPE mess, and it's much faster, because it only uses a little bit of RAM at a time. 这样可以避免所有subprocess.PIPE混乱，而且速度更快，因为它一次只使用一点RAM。 (The IO from your subprocess might be faster, too, depending on how your operating system handles output redirection.) （来自子进程的IO也可能更快，具体取决于操作系统处理输出重定向的方式。）

Again, I can't test run , so here's a slightly older Popen -and- communicate solution (minus main and the rest of your setup): 再一次，我无法测试run ，所以这里是一个稍微旧的Popen和communicate解决方案（减去main和你的其他设置）：

import io
import subprocess
import tempfile

def are_text_files_equal(file0, file1):
    '''
    Both files must be opened in "update" mode ('+' character), so
    they can be rewound to their beginnings.  Both files will be read
    until just past the first differing line, or to the end of the
    files if no differences were encountered.
    '''
    file0.seek(io.SEEK_SET)
    file1.seek(io.SEEK_SET)
    for line0, line1 in zip(file0, file1):
        if line0 != line1:
            return False
    # Both files were identical to this point.  See if either file
    # has more data.
    next0 = next(file0, '')
    next1 = next(file1, '')
    if next0 or next1:
        return False
    return True

def compare_subprocess_output(exe_path, input_path):
    with tempfile.TemporaryFile(mode='w+t', encoding='utf8') as temp_file:
        with open(input_path, 'r+t') as input_file:
            p = subprocess.Popen(
              [exe_path],
              stdin=input_file,
              stdout=temp_file,  # No more PIPE.
              stderr=subprocess.PIPE,  # <sigh>
              universal_newlines=True,
              )
            err = p.communicate()[1]  # No need to store output.
            # Compare input and output files...  This must be inside
            # the `with` block, or the TemporaryFile will close before
            # we can use it.
            if are_text_files_equal(temp_file, input_file):
                print('OK')
            else:
                print('Failed: ' + str(err))
    return

Unfortunately, since I can't reproduce your problem, even with a million-line input, I can't tell if this works . 不幸的是，由于我无法重现你的问题，即使有百万行输入，我也不知道这是否有效。 If nothing else, it ought to give you wrong answers faster. 如果不出意外，它应该更快地给你错误的答案。

Variant: Regular File 变体：常规文件

If you want to keep the output of your test run in foo.txt (from your command-line example), then you would direct your subprocess' output to a normal file instead of a TemporaryFile . 如果要将测试运行的输出保存在foo.txt （从命令行示例中），则可以将子进程的输出定向到普通文件而不是TemporaryFile 。 This is the solution recommended in JF Sebastian's answer . 这是JF Sebastian的回答中推荐的解决方案。

I can't tell from your question if you wanted foo.txt , or if it was just a side-effect of the two step test-then- diff --- your command-line example saves test output to a file, while your Python script doesn't. 我不能从你的问题中判断你是否想要 foo.txt ，或者它只是两步测试的副作用 - 然后 - diff - 你的命令行示例将测试输出保存到文件中，而你的Python脚本没有。 Saving the output would be handy if you ever want to investigate a test failure, but it requires coming up with a unique filename for each test you run, so they don't overwrite each other's output. 如果您想要调查测试失败，保存输出会很方便，但它需要为您运行的每个测试提供唯一的文件名，因此它们不会覆盖彼此的输出。

为什么subprocess.run输出与同一命令的shell输出不同？

问题描述

2 个解决方案

解决方案1
6 已采纳 2016-06-10 15:21:34

解决方案2
1 2016-06-09 22:07:21

Problems with `subprocess.PIPE` and Family `subprocess.PIPE`和Family的问题

Alternative: `tempfile.TemporaryFile` 替代方案： `tempfile.TemporaryFile`

Variant: Regular File 变体：常规文件

为什么subprocess.run输出与同一命令的shell输出不同？

问题描述

2 个解决方案

解决方案1 6 已采纳 2016-06-10 15:21:34

解决方案2 1 2016-06-09 22:07:21

Problems with subprocess.PIPE and Family subprocess.PIPE和Family的问题

Alternative: tempfile.TemporaryFile 替代方案： tempfile.TemporaryFile

Variant: Regular File 变体：常规文件

解决方案1
6 已采纳 2016-06-10 15:21:34

解决方案2
1 2016-06-09 22:07:21

Problems with `subprocess.PIPE` and Family `subprocess.PIPE`和Family的问题

Alternative: `tempfile.TemporaryFile` 替代方案： `tempfile.TemporaryFile`