简体   繁体   中英

Python equivalent of piping zcat result to filehandle in Perl

I have a huge pipeline written in Python that uses very large .gz files (~14GB compressed), but need a better way to send certain lines to an external software ( formatdb from blast-legacy/2.2.26 ). I have a Perl script someone wrote for me a long time ago that does this extremely fast, but I need to do that same thing in Python given that the rest of the pipeline is written in Python and I have to keep it that way. The Perl script uses two file handles, one to hold zcat of .gz file and the other to store the lines the software needs (2 of every 4) and use it as the input. It involves bioinformatics, but no experience is needed. The file is in fastq format and the software needs it in fasta format. Every 4 lines is a fastq record, take the 1st and 3rd line and add '>' to the beginning of the 1st line and that is the fasta equivalent the formatdb software will use for each record.

The perl script is as follows:

#!/usr/bin/perl 
my $SRG = $ARGV[0]; # reads.fastq.gz

open($fh, sprintf("zcat %s |", $SRG)) or die "Broken gunzip $!\n";

# -i: input -n: db name -p: program 
open ($fh2, "| formatdb -i stdin -n $SRG -p F") or die "no piping formatdb!, $!\n";

#Fastq => Fasta sub
my $localcounter = 0;
while (my $line = <$fh>){
        if ($. % 4==1){
                print $fh2 "\>" . substr($line, 1);
                $localcounter++;
        }
        elsif ($localcounter == 1){
                print $fh2 "$line";
                $localcounter = 0;
        }
        else{
        }
}
close $fh;
close $fh2;
exit;

It works really well. How could I do this same thing in Python? I like how Perl can use those file handles, but I'm not sure how to do that in Python without creating an actual file. All I can think of is to gzip.open the file and write the two lines of every record I need to a new file and use that with "formatdb", but it is way too slow. Any ideas? I need to work it into the python pipeline, so I can't just rely on the perl script and I would also like to know how to do this in general. I assume I need to use some form of the subprocess module.

Here is my Python code, but again it is way to slow and speed is the issue here (HUGE FILES):

#!/usr/bin/env python

import gzip
from Bio import SeqIO # can recognize fasta/fastq records
import subprocess as sp
import os,sys

filename = sys.argv[1] # reads.fastq.gz

tempFile = filename + ".temp.fasta"

outFile = open(tempFile, "w")

handle = gzip.open(filename, "r")
# parses each fastq record
# r.id and r.seq are the 1st and 3rd lines of each record
for r in SeqIO.parse(handle, "fastq"):
    outFile.write(">" + str(r.id) + "\n")
    outFile.write(str(r.seq) + "\n")

handle.close()
outFile.close()

    cmd = 'formatdb -i ' + str(tempFile) + ' -n ' + filename + ' -p F '
    sp.call(cmd, shell=True)

    cmd = 'rm ' + tempFile
    sp.call(cmd, shell=True)

First, there's a much better solution in both Perl and Python: just use a gzip library. In Python, there's one in the stdlib ; in Perl, you can find one on CPAN. For example:

with gzip.open(path, 'r', encoding='utf-8') as f:
    for line in f:
        do_stuff(line)

Much simpler, and more efficient, and more portable, than shelling out to zcat .


But if you really do want to launch a subprocess and control its pipes in Python, you can do it with the subprocess module. And, unlike perl, Python can do this without having to stick a shell in the middle. There's even a nice section in the docs on Replacing Older Functions with the subprocess Module that gives you recipes.

So:

zcat = subprocess.Popen(['zcat', path], stdout=subprocess.PIPE)

Now, zcat.stdout is a file-like object, with the usual read methods and so on, wrapping the pipe to the zcat subprocess.

So, for example, to read a binary file 8K at a time in Python 3.x:

zcat = subprocess.Popen(['zcat', path], stdout=subprocess.PIPE)
for chunk in iter(functools.partial(zcat.stdout.read, 8192), b''):
    do_stuff(chunk)
zcat.wait()

(If you want to do this in Python 2.x, or read a text file one line at a time instead of a binary file 8K at a time, or whatever, the changes are the same as they'd be for any other file-handling coding.)

You can parse the whole file and load it as a list of lines using this function:

    def convert_gz_to_list_of_lines(filepath):
     """Parse gz file and convert it into a list of lines."""
     file_as_list = list()
     with gzip.open(filepath, 'rt', encoding='utf-8') as f:
      try:
       for line in f:
        file_as_list.append(line)
      except EOFError:
        file_as_list = file_as_list
      return file_as_list

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM