简体   繁体   English

使用 subprocess.Popen 时将大量数据通过管道传输到标准输入

[英]pipe large amount of data to stdin while using subprocess.Popen

I'm kind of struggling to understand what is the python way of solving this simple problem.我有点难以理解解决这个简单问题的python方法是什么。

My problem is quite simple.我的问题很简单。 If you use the follwing code it will hang.如果您使用以下代码,它将挂起。 This is well documented in the subprocess module doc.这在子流程模块文档中有详细记录。

import subprocess

proc = subprocess.Popen(['cat','-'],
                        stdin=subprocess.PIPE,
                        stdout=subprocess.PIPE,
                        )
for i in range(100000):
    proc.stdin.write('%d\n' % i)
output = proc.communicate()[0]
print output

Searching for a solution (there is a very insightful thread, but I've lost it now) I found this solution (among others) that uses an explicit fork:寻找解决方案(有一个非常有见地的线程,但我现在已经丢失了)我找到了这个使用显式分叉的解决方案(除其他外):

import os
import sys
from subprocess import Popen, PIPE

def produce(to_sed):
    for i in range(100000):
        to_sed.write("%d\n" % i)
        to_sed.flush()
    #this would happen implicitly, anyway, but is here for the example
    to_sed.close()

def consume(from_sed):
    while 1:
        res = from_sed.readline()
        if not res:
            sys.exit(0)
            #sys.exit(proc.poll())
        print 'received: ', [res]

def main():
    proc = Popen(['cat','-'],stdin=PIPE,stdout=PIPE)
    to_sed = proc.stdin
    from_sed = proc.stdout

    pid = os.fork()
    if pid == 0 :
        from_sed.close()
        produce(to_sed)
        return
    else :
        to_sed.close()
        consume(from_sed)

if __name__ == '__main__':
    main()

While this solution is conceptually very easy to understand, it uses one more process and stuck as too low level compared to the subprocess module (that is there just to hide this kind of things...).虽然这个解决方案在概念上很容易理解,但它使用了更多的进程并且与子进程模块相比被卡在太低的级别上(那只是为了隐藏这种事情......)。

I'm wondering: is there a simple and clean solution using the subprocess module that won't hung or to implement this patter I have to do a step back and implement an old-style select loop or an explicit fork?我想知道:是否有使用不会挂起的子流程模块的简单而干净的解决方案,或者要实现此模式我必须退后一步并实现旧式选择循环或显式分叉?

Thanks谢谢

If you want a pure Python solution, you need to put either the reader or the writer in a separate thread.如果你想要一个纯 Python 解决方案,你需要将读者或作者放在一个单独的线程中。 The threading package is a lightweight way to do this, with convenient access to common objects and no messy forking. threading包是一种轻量级的方式来做到这一点,可以方便地访问公共对象,并且没有混乱的分叉。

import subprocess
import threading
import sys

proc = subprocess.Popen(['cat','-'],
                        stdin=subprocess.PIPE,
                        stdout=subprocess.PIPE,
                        )
def writer():
    for i in range(100000):
        proc.stdin.write(b'%d\n' % i)
    proc.stdin.close()
thread = threading.Thread(target=writer)
thread.start()
for line in proc.stdout:
    sys.stdout.write(line.decode())
thread.join()
proc.wait()

It might be neat to see the subprocess module modernized to support streams and coroutines, which would allow pipelines that mix Python pieces and shell pieces to be constructed more elegantly.看到subprocess模块现代化以支持流和协程可能会很好,这将允许更优雅地构建混合 Python 部分和 shell 部分的管道。

If you don't want to keep all the data in memory, you have to use select.如果您不想将所有数据保留在内存中,则必须使用 select。 Eg something like:例如:

import subprocess
from select import select
import os

proc = subprocess.Popen(['cat'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)

i = 0;
while True:
    rlist, wlist, xlist = [proc.stdout], [], []
    if i < 100000:
        wlist.append(proc.stdin)
    rlist, wlist, xlist = select(rlist, wlist, xlist)
    if proc.stdout in rlist:
        out = os.read(proc.stdout.fileno(), 10)
        print out,
        if not out:
            break
    if proc.stdin in wlist:
        proc.stdin.write('%d\n' % i)
        i += 1
        if i >= 100000:
            proc.stdin.close()

Your code deadlocks as soon as cat 's stdout OS pipe buffer is full.一旦cat的 stdout OS 管道缓冲区已满,您的代码就会死锁。 If you use stdout=PIPE ;如果您使用stdout=PIPE ; you have to consume it in time otherwise the deadlock as in your case may happen.您必须及时消耗它,否则可能会发生您的情况的僵局。

If you don't need the output while the process is running;如果您在进程运行时不需要输出; you could redirect it to a temporary file:你可以将它重定向到一个临时文件:

#!/usr/bin/env python3
import subprocess
import tempfile

with tempfile.TemporaryFile('r+') as output_file:
    with subprocess.Popen(['cat'],
                          stdin=subprocess.PIPE,
                          stdout=output_file,
                          universal_newlines=True) as process:
        for i in range(100000):
            print(i, file=process.stdin)
    output_file.seek(0)  # rewind (and sync with the disk)
    print(output_file.readline(), end='')  # get  the first line of the output

If the input/output are small (fit in memory);如果输入/输出很小(适合内存); you could pass the input all at once and get the output all at once using .communicate() that reads/writes concurrently for you:您可以一次传递所有输入并使用.communicate()为您同时读取/写入所有输出:

#!/usr/bin/env python3
import subprocess

cp = subprocess.run(['cat'], input='\n'.join(['%d' % i for i in range(100000)]),
                    stdout=subprocess.PIPE, universal_newlines=True)
print(cp.stdout.splitlines()[-1]) # print the last line

To read/write concurrently manually, you could use threads, asyncio, fcntl, etc. @Jed provided a simple thread-based solution .要手动并发读/写,您可以使用线程、asyncio、fcntl 等。 @Jed 提供了一个简单的基于线程的解决方案 Here's asyncio -based solution:这是基于asyncio的解决方案:

#!/usr/bin/env python3
import asyncio
import sys
from subprocess import PIPE

async def pump_input(writer):
     try:
         for i in range(100000):
             writer.write(b'%d\n' % i)
             await writer.drain()
     finally:
         writer.close()

async def run():
    # start child process
    # NOTE: universal_newlines parameter is not supported
    process = await asyncio.create_subprocess_exec('cat', stdin=PIPE, stdout=PIPE)
    asyncio.ensure_future(pump_input(process.stdin)) # write input
    async for line in process.stdout: # consume output
        print(int(line)**2) # print squares
    return await process.wait()  # wait for the child process to exit


if sys.platform.startswith('win'):
    loop = asyncio.ProactorEventLoop() # for subprocess' pipes on Windows
    asyncio.set_event_loop(loop)
else:
    loop = asyncio.get_event_loop()
loop.run_until_complete(run())
loop.close()

On Unix, you could use fcntl -based solution:在 Unix 上,您可以使用基于fcntl的解决方案:

#!/usr/bin/env python3
import sys
from fcntl import fcntl, F_GETFL, F_SETFL
from os import O_NONBLOCK
from shutil import copyfileobj
from subprocess import Popen, PIPE, _PIPE_BUF as PIPE_BUF

def make_blocking(pipe, blocking=True):
    fd = pipe.fileno()
    if not blocking:
        fcntl(fd, F_SETFL, fcntl(fd, F_GETFL) | O_NONBLOCK) # set O_NONBLOCK
    else:
        fcntl(fd, F_SETFL, fcntl(fd, F_GETFL) & ~O_NONBLOCK) # clear it


with Popen(['cat'], stdin=PIPE, stdout=PIPE) as process:
    make_blocking(process.stdout, blocking=False)
    with process.stdin:
        for i in range(100000):
            #NOTE: the mode is block-buffered (default) and therefore
            # `cat` won't see it immidiately
            process.stdin.write(b'%d\n' % i)
            # a deadblock may happen here with a *blocking* pipe
            output = process.stdout.read(PIPE_BUF)
            if output is not None:
                sys.stdout.buffer.write(output)
    # read the rest
    make_blocking(process.stdout)
    copyfileobj(process.stdout, sys.stdout.buffer)

Here's something I used to load 6G mysql dump file loads via subprocess.这是我用来通过子进程加载 6G mysql 转储文件的东西。 Stay away from shell=True.远离 shell=True。 Not secure and start out of process wasting resources.不安全,开始进程外浪费资源。

import subprocess

fhandle = None

cmd = [mysql_path,
      "-u", mysql_user, "-p" + mysql_pass],
      "-h", host, database]

fhandle = open(dump_file, 'r')
p = subprocess.Popen(cmd, stdin=fhandle, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

(stdout,stderr) = p.communicate()

fhandle.close()

Using the aiofiles & asyncio in python 3.5:在 python 3.5 中使用 aiofiles & asyncio:

A bit complicated, but you need only 1024 Bytes memory to writing in stdin!有点复杂,但你只需要 1024 字节的内存就可以写入标准输入!

import asyncio
import aiofiles
import sys
from os.path import dirname, join, abspath
import subprocess as sb


THIS_DIR = abspath(dirname(__file__))
SAMPLE_FILE = join(THIS_DIR, '../src/hazelnut/tests/stuff/sample.mp4')
DEST_PATH = '/home/vahid/Desktop/sample.mp4'


async def async_file_reader(f, buffer):
    async for l in f:
        if l:
            buffer.append(l)
        else:
            break
    print('reader done')


async def async_file_writer(source_file, target_file):
    length = 0
    while True:
        input_chunk = await source_file.read(1024)
        if input_chunk:
            length += len(input_chunk)
            target_file.write(input_chunk)
            await target_file.drain()
        else:
            target_file.write_eof()
            break

    print('writer done: %s' % length)


async def main():
    dir_name = dirname(DEST_PATH)
    remote_cmd = 'ssh localhost mkdir -p %s && cat - > %s' % (dir_name, DEST_PATH)

    stdout, stderr = [], []
    async with aiofiles.open(SAMPLE_FILE, mode='rb') as f:
        cmd = await asyncio.create_subprocess_shell(
            remote_cmd,
            stdin=sb.PIPE,
            stdout=sb.PIPE,
            stderr=sb.PIPE,
        )

        await asyncio.gather(*(
            async_file_reader(cmd.stdout, stdout),
            async_file_reader(cmd.stderr, stderr),
            async_file_writer(f, cmd.stdin)
        ))

        print('EXIT STATUS: %s' % await cmd.wait())

    stdout, stderr = '\n'.join(stdout), '\n'.join(stderr)

    if stdout:
        print(stdout)

    if stderr:
        print(stderr, file=sys.stderr)


if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

Result:结果:

writer done: 383631
reader done
reader done
EXIT STATUS: 0

For this kind of thing, the shell works a lot better than subprocess.对于这种事情,shell 比 subprocess 好很多。

Write very simple Python apps which read from sys.stdin and write to sys.stdout .编写非常简单的 Python 应用程序,从sys.stdin读取并写入sys.stdout

Connect the simple apps together using a shell pipeline.使用 shell 管道将简单的应用程序连接在一起。

If you want, start the pipeline using subprocess or simply write a one-line shell script.如果需要,可以使用subprocess启动管道或简单地编写一个单行 shell 脚本。

python part1.py | python part2.py

This is very, very efficient.这是非常非常有效的。 It's also portable to all Linux (and Windows) as long as you keep it very simple.只要您保持非常简单,它也可以移植到所有 Linux(和 Windows)。

The simplest solution I can think of:我能想到的最简单的解决方案:

from subprocess import Popen, PIPE
from threading import Thread

s = map(str,xrange(10000)) # a large string
p = Popen(['cat'], stdin=PIPE, stdout=PIPE, bufsize=1)
Thread(target=lambda: any((p.stdin.write(b) for b in s)) or p.stdin.close()).start()
print (p.stdout.read())

Buffered version:缓冲版本:

from subprocess import Popen, PIPE
from threading import Thread

s = map(str,xrange(10000)) # a large string
n = 1024 # buffer size
p = Popen(['cat'], stdin=PIPE, stdout=PIPE, bufsize=n)
Thread(target=lambda: any((p.stdin.write(c) for c in (s[i:i+n] for i in xrange(0, len(s), n)))) or p.stdin.close()).start()
print (p.stdout.read())

I was looking for an example code to iterate over process output incrementally as this process consumes its input from provided iterator (incrementally as well).我正在寻找一个示例代码来增量地迭代进程输出,因为这个进程从提供的迭代器(也增量地)消耗它的输入。 Basically:基本上:

import string
import random

# That's what I consider a very useful function, though didn't
# find any existing implementations.
def process_line_reader(args, stdin_lines):
    # args - command to run, same as subprocess.Popen
    # stdin_lines - iterable with lines to send to process stdin
    # returns - iterable with lines received from process stdout
    pass

# Returns iterable over n random strings. n is assumed to be infinity if negative.
# Just an example of function that returns potentially unlimited number of lines.
def random_lines(n, M=8):
    while 0 != n:
        yield "".join(random.choice(string.letters) for _ in range(M))
        if 0 < n:
            n -= 1

# That's what I consider to be a very convenient use case for
# function proposed above.
def print_many_uniq_numbered_random_lines():
    i = 0
    for line in process_line_reader(["uniq", "-i"], random_lines(100500 * 9000)):
        # Key idea here is that `process_line_reader` will feed random lines into
        # `uniq` process stdin as lines are consumed from returned iterable.
        print "#%i: %s" % (i, line)
        i += 1

Some of solutions suggested here allow to do it with threads (but it's not always convenient) or with asyncio (which is not available in Python 2.x).这里建议的一些解决方案允许使用线程(但并不总是方便)或使用 asyncio(在 Python 2.x 中不可用)来完成。 Below is example of working implementation that allows to do it.下面是允许这样做的工作实现示例。

import subprocess
import os
import fcntl
import select

class nonblocking_io(object):
    def __init__(self, f):
        self._fd = -1
        if type(f) is int:
            self._fd = os.dup(f)
            os.close(f)
        elif type(f) is file:
            self._fd = os.dup(f.fileno())
            f.close()
        else:
            raise TypeError("Only accept file objects or interger file descriptors")
        flag = fcntl.fcntl(self._fd, fcntl.F_GETFL)
        fcntl.fcntl(self._fd, fcntl.F_SETFL, flag | os.O_NONBLOCK)
    def __enter__(self):
        return self
    def __exit__(self, type, value, traceback):
        self.close()
        return False
    def fileno(self):
        return self._fd
    def close(self):
        if 0 <= self._fd:
            os.close(self._fd)
            self._fd = -1

class nonblocking_line_writer(nonblocking_io):
    def __init__(self, f, lines, autoclose=True, buffer_size=16*1024, encoding="utf-8", linesep=os.linesep):
        super(nonblocking_line_writer, self).__init__(f)
        self._lines = iter(lines)
        self._lines_ended = False
        self._autoclose = autoclose
        self._buffer_size = buffer_size
        self._buffer_offset = 0
        self._buffer = bytearray()
        self._encoding = encoding
        self._linesep = bytearray(linesep, encoding)
    # Returns False when `lines` iterable is exhausted and all pending data is written
    def continue_writing(self):
        while True:
            if self._buffer_offset < len(self._buffer):
                n = os.write(self._fd, self._buffer[self._buffer_offset:])
                self._buffer_offset += n
                if self._buffer_offset < len(self._buffer):
                    return True
            if self._lines_ended:
                if self._autoclose:
                    self.close()
                return False
            self._buffer[:] = []
            self._buffer_offset = 0
            while len(self._buffer) < self._buffer_size:
                line = next(self._lines, None)
                if line is None:
                    self._lines_ended = True
                    break
                self._buffer.extend(bytearray(line, self._encoding))
                self._buffer.extend(self._linesep)

class nonblocking_line_reader(nonblocking_io):
    def __init__(self, f, autoclose=True, buffer_size=16*1024, encoding="utf-8"):
        super(nonblocking_line_reader, self).__init__(f)
        self._autoclose = autoclose
        self._buffer_size = buffer_size
        self._encoding = encoding
        self._file_ended = False
        self._line_part = ""
    # Returns (lines, more) tuple, where lines is iterable with lines read and more will
    # be set to False after EOF.
    def continue_reading(self):
        lines = []
        while not self._file_ended:
            data = os.read(self._fd, self._buffer_size)
            if 0 == len(data):
                self._file_ended = True
                if self._autoclose:
                    self.close()
                if 0 < len(self._line_part):
                    lines.append(self._line_part.decode(self._encoding))
                    self._line_part = ""
                break
            for line in data.splitlines(True):
                self._line_part += line
                if self._line_part.endswith(("\n", "\r")):
                    lines.append(self._line_part.decode(self._encoding).rstrip("\n\r"))
                    self._line_part = ""
            if len(data) < self._buffer_size:
                break
        return (lines, not self._file_ended)

class process_line_reader(object):
    def __init__(self, args, stdin_lines):
        self._p = subprocess.Popen(args, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
        self._reader = nonblocking_line_reader(self._p.stdout)
        self._writer = nonblocking_line_writer(self._p.stdin, stdin_lines)
        self._iterator = self._communicate()
    def __iter__(self):
        return self._iterator
    def __enter__(self):
        return self._iterator
    def __exit__(self, type, value, traceback):
        self.close()
        return False
    def _communicate(self):
        read_set = [self._reader]
        write_set = [self._writer]
        while read_set or write_set:
            try:
                rlist, wlist, xlist = select.select(read_set, write_set, [])
            except select.error, e:
                if e.args[0] == errno.EINTR:
                    continue
                raise
            if self._reader in rlist:
                stdout_lines, more = self._reader.continue_reading()
                for line in stdout_lines:
                    yield line
                if not more:
                    read_set.remove(self._reader)
            if self._writer in wlist:
                if not self._writer.continue_writing():
                    write_set.remove(self._writer)
        self.close()
    def lines(self):
        return self._iterator
    def close(self):
        if self._iterator is not None:
            self._reader.close()
            self._writer.close()
            self._p.wait()
            self._iterator = None

Here is an example (Python 3) of reading one record at a time from gzip using a pipe:这是使用管道从 gzip 一次读取一条记录的示例(Python 3):

cmd = 'gzip -dc compressed_file.gz'
pipe = Popen(cmd, stdout=PIPE).stdout

for line in pipe:
    print(":", line.decode(), end="")

I know there is a standard module for that, it is just meant as an example.我知道有一个标准模块,它只是作为一个例子。 You can read the whole output in one go (like shell back-ticks) using the communicate method, but obviously you hav eto be careful of memory size.您可以使用通信方法一次性读取整个输出(如 shell 反引号),但显然您必须注意内存大小。

Here is an example (Python 3 again) of writing records to the lp(1) program on Linux:这是在 Linux 上将记录写入 lp(1) 程序的示例(再次使用 Python 3):

cmd = 'lp -'
proc = Popen(cmd, stdin=PIPE)
proc.communicate(some_data.encode())

Now I know this is not going to satisfy the purist in you completely, as the input will have to fit in memory, and you have no option to work interactively with input-output, but at least this works fine on your example.现在我知道这不会完全满足您的纯粹主义者,因为输入必须适合内存,并且您无法选择与输入输出交互工作,但至少这在您的示例中工作正常。 The communicate method optionally takes the input as an argument, and if you feed your process its input this way, it will work.通信方法可选地将输入作为参数,如果您以这种方式为流程提供输入,它将起作用。

import subprocess

proc = subprocess.Popen(['cat','-'],
                        stdin=subprocess.PIPE,
                        stdout=subprocess.PIPE,
                        )

input = "".join('{0:d}\n'.format(i) for i in range(100000))
output = proc.communicate(input)[0]
print output

As for the larger problem, you can subclass Popen, rewrite __init__ to accept stream-like objects as arguments to stdin, stdout, stderr, and rewrite the _communicate method (hairy for crossplatform, you need to do it twice, see the subprocess.py source) to call read() on the stdin stream and write() the output to the stdout and stderr streams.至于更大的问题,可以子类化Popen,重写__init__ ,接受类流对象作为stdin、stdout、stderr的参数,重写_communicate方法( _communicate毛茸茸的,需要做两次,见subprocess.py source) 在 stdin 流上调用 read() 并将输出 write() 到 stdout 和 stderr 流。 What bothers me about this approach is that as far as I know, it hasn't already been done.这种方法让我困扰的是,据我所知,它还没有完成。 When obvious things have not been done before, there's usually a reason (it doesn't work as intended), but I can't see why it shoudn't, apart from the fact you need the streams to be thread-safe in Windows.如果以前没有做过明显的事情,通常是有原因的(它没有按预期工作),但我不明白为什么不应该这样做,除了你需要流在 Windows 中是线程安全的.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM