简体   繁体   中英

How to join the stdout of two subprocesses and pipe to stdin of new subprocess in python

Lets say I had the following command running from the shell

{ 
samtools view -HS header.sam;           # command1
samtools view input.bam 1:1-50000000;   # command2
} | samtools view -bS - > output.bam    # command3

For those of you who aren't familiar with samtools view (Since this is stackoverflow). What this is essentially doing is creating a new bam file that has a new header. bam files are typically large compressed files, so even passing through the file in some cases can be time consuming. One alternative approach would be to undergo command2, and then use samtools reheader to switch the header. This passes through the large file twice. The above command passes through the bam a single time which is good for larger bam files (They get to be larger then 20GB even when compressed - WGS).

My question is how to implement commands of this type in python using subprocess.

I have the following:

fh_bam = open('output.bam', 'w')
params_0 = [ "samtools", "view", "-HS", "header.sam" ]
params_1 = [ "samtools", "view", "input.bam", "1:1-50000000"]
params_2 = [ "samtools", "view", "-bS", "-" ]
sub_0 = subprocess.Popen(params_0, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
sub_1 = subprocess.Popen(params_1, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
### SOMEHOW APPEND sub_1.stdout to sub_0.stdout
sub_2 = subprocess.Popen(params_2, stdin=appended.stdout, stdout=fh_bam)

Any help is greatly appreciated. Thanks.

If you already have the shell command in the string then you could just run it as is:

#!/usr/bin/env python
from subprocess import check_call

check_call(r"""
{ 
samtools view -HS header.sam;           # command1
samtools view input.bam 1:1-50000000;   # command2
} | samtools view -bS - > output.bam    # command3
""", shell=True)

To emulate the pipeline in Python:

#!/usr/bin/env python
from subprocess import Popen, PIPE

# start command3 to get stdin pipe, redirect output to the file
with open('output.bam', 'wb', 0) as output_file:
    command3 = Popen("samtools view -bS -".split(), 
                     stdin=PIPE, stdout=output_file)
# start command1 with its stdout redirected to command3 stdin
command1 = Popen('samtools view -HS header.sam'.split(),
                 stdout=command3.stdin)
rc_command1 = command1.wait() #NOTE: command3.stdin is not closed, no SIGPIPE or a write error if command3 dies
# start command2 after command1 finishes
command2 = Popen('samtools view input.bam 1:1-50000000'.split(),
                 stdout=command3.stdin)
command3.stdin.close() # inform command2 if command3 dies (SIGPIPE or a write error)
rc_command2 = command2.wait()
rc_command3 = command3.wait()

(I can't comment sadly, but this 'answer' is a comment to cmidi's answer, if somebody can move it it would be appreciated! -- PS: That answer was now deleted...)

Marco explicitly said that the commands produce a lot of output, around 20GB. If you use communicate() it will wait for the process to terminate, which means that the 'fd' descriptor will need to hold that big amount of data. In practice the OS will flush the data to disk in the meantime, unless your computer has more than 20GB free RAM. So you end up writing the intermediate data to disk, which the original author wanted to avoid. +1 for sirlark's answer!

I assume that concatenating the outputs from the first two subprocesses in memory is not feasible due to the size of the files involved. I'd suggest wrapping the outputs of the first two subproccesses in a file like. It looks like you'll only need the read method, since popen will only ever read from its stdin file-like, not seek or write. The code below assumes that returning an empty string from read is sufficient to indicate the stream is at EOF

class concat(object):
    def __init__(self, f1, f2):
        self.f1 = f1
        self.f2 = f2

    def read(self, *args):
        ret = self.f1.read(*args)
        if ret == '':
            ret = self.f2.read(*args)
        return ret

fh_bam = open('output.bam', 'w')
params_0 = [ "samtools", "view", "-HS", "header.sam" ]
params_1 = [ "samtools", "view", "input.bam", "1:1-50000000"]
params_2 = [ "samtools", "view", "-bS", "-" ]
sub_0 = subprocess.Popen(params_0, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
sub_1 = subprocess.Popen(params_1, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
### Somehow Append sub_1.stdout to sub_0.stdout
sub_2 = subprocess.Popen(params_2, stdin=concat(sub_0.stdout, sub_1.stdout), stdout=fh_bam)

To clarify, f1.read will block, and only return '' when the pipe is closed/EOF'd. concat.read only tries to read from f2 after this happens, so output from f1 and f2 wouldn't be interwoven. There is of course a slight overhead of reading the end of f1 repeatedly, which could be averted by setting a flag variable to indicate which file to read from. I doubt it would improv performance significantly though.

While Popen accepts file-like objects, it actually uses the underlying file handles/descriptors, not the read and write methods of the file objects to communicate, as @JF Sebastian rightly points out. A better way to do this is to use a pipe ( os.pipe() ) which doesn't use the disk. This allows you to connect the output stream directly to the input stream of another process, which is exactly what you want. The problem is then just a matter of serialisation, to make sure the two source streams don't interleave.

import os
import subprocess

r, w = os.pipe()

fh_bam = open('output.bam', 'w')
params_0 = [ "samtools", "view", "-HS", "header.sam" ]
params_1 = [ "samtools", "view", "input.bam", "1:1-50000000"]
params_2 = [ "samtools", "view", "-bS", "-" ]
sub_sink = subprocess.Popen(params_2, stdin=r, stdout=fh_bam, bufsize=4096)
sub_src1 = subprocess.Popen(params_0, stderr=subprocess.PIPE, stdout=w, bufsize=4096)
sub_src1.communicate()
sub_src2 = subprocess.Popen(params_1, stderr=subprocess.PIPE, stdout=w, bufsize=4096)
sub_src2.communicate()

We open the sink (the reader of the pipe) first and then communicate with the source processes only to avoid potential blocking as mentioned by @Ariel. This also forces the first source process to complete and flush its output over the pipe, before the second source process gets a chance to write to the pipe, preventing interleaved/clobbered output. You can play with the bufsize value to tweak performance.

This is pretty much exactly what the shell command is doing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM