简体   繁体   English

Python相当于管道zcat结果到Perl中的文件句柄

[英]Python equivalent of piping zcat result to filehandle in Perl

I have a huge pipeline written in Python that uses very large .gz files (~14GB compressed), but need a better way to send certain lines to an external software ( formatdb from blast-legacy/2.2.26 ). 我有一个用Python编写的巨大管道,它使用非常大的.gz文件(约14GB压缩),但需要更好的方法将某些行发送到外部软件( 来自blast-legacy / 2.2.26的formatdb )。 I have a Perl script someone wrote for me a long time ago that does this extremely fast, but I need to do that same thing in Python given that the rest of the pipeline is written in Python and I have to keep it that way. 我有很长一段时间以前为我写过的Perl脚本这样做非常快,但我需要在Python中做同样的事情,因为管道的其余部分是用Python编写的,我必须保持这种方式。 The Perl script uses two file handles, one to hold zcat of .gz file and the other to store the lines the software needs (2 of every 4) and use it as the input. Perl脚本使用两个文件句柄,一个用于保存.gz文件的zcat,另一个用于存储软件所需的行(每4个中有2个)并将其用作输入。 It involves bioinformatics, but no experience is needed. 它涉及生物信息学,但不需要经验。 The file is in fastq format and the software needs it in fasta format. 该文件采用fastq格式,软件需要采用fasta格式。 Every 4 lines is a fastq record, take the 1st and 3rd line and add '>' to the beginning of the 1st line and that is the fasta equivalent the formatdb software will use for each record. 每4行是一个fastq记录,取第1行和第3行并将'>'添加到第1行的开头,这就是formatdb软件将用于每个记录的fasta等价物。

The perl script is as follows: perl脚本如下:

#!/usr/bin/perl 
my $SRG = $ARGV[0]; # reads.fastq.gz

open($fh, sprintf("zcat %s |", $SRG)) or die "Broken gunzip $!\n";

# -i: input -n: db name -p: program 
open ($fh2, "| formatdb -i stdin -n $SRG -p F") or die "no piping formatdb!, $!\n";

#Fastq => Fasta sub
my $localcounter = 0;
while (my $line = <$fh>){
        if ($. % 4==1){
                print $fh2 "\>" . substr($line, 1);
                $localcounter++;
        }
        elsif ($localcounter == 1){
                print $fh2 "$line";
                $localcounter = 0;
        }
        else{
        }
}
close $fh;
close $fh2;
exit;

It works really well. 它工作得很好。 How could I do this same thing in Python? 我怎么能在Python中做同样的事情? I like how Perl can use those file handles, but I'm not sure how to do that in Python without creating an actual file. 我喜欢Perl如何使用这些文件句柄,但我不确定如何在不创建实际文件的情况下在Python中执行此操作。 All I can think of is to gzip.open the file and write the two lines of every record I need to a new file and use that with "formatdb", but it is way too slow. 我能想到的只是gzip.open文件并将我需要的每条记录的两行写入一个新文件并将其与“formatdb”一起使用,但它太慢了。 Any ideas? 有任何想法吗? I need to work it into the python pipeline, so I can't just rely on the perl script and I would also like to know how to do this in general. 我需要将它工作到python管道中,所以我不能只依赖于perl脚本,而且我也想知道如何一般地执行此操作。 I assume I need to use some form of the subprocess module. 我假设我需要使用某种形式的子进程模块。

Here is my Python code, but again it is way to slow and speed is the issue here (HUGE FILES): 这是我的Python代码,但同样是慢速和速度是这里的问题(巨大的文件):

#!/usr/bin/env python

import gzip
from Bio import SeqIO # can recognize fasta/fastq records
import subprocess as sp
import os,sys

filename = sys.argv[1] # reads.fastq.gz

tempFile = filename + ".temp.fasta"

outFile = open(tempFile, "w")

handle = gzip.open(filename, "r")
# parses each fastq record
# r.id and r.seq are the 1st and 3rd lines of each record
for r in SeqIO.parse(handle, "fastq"):
    outFile.write(">" + str(r.id) + "\n")
    outFile.write(str(r.seq) + "\n")

handle.close()
outFile.close()

    cmd = 'formatdb -i ' + str(tempFile) + ' -n ' + filename + ' -p F '
    sp.call(cmd, shell=True)

    cmd = 'rm ' + tempFile
    sp.call(cmd, shell=True)

First, there's a much better solution in both Perl and Python: just use a gzip library. 首先,在Perl和Python中都有一个更好的解决方案:只需使用一个gzip库。 In Python, there's one in the stdlib ; 在Python中, stdlib中有一个; in Perl, you can find one on CPAN. 在Perl中,您可以在CPAN上找到一个。 For example: 例如:

with gzip.open(path, 'r', encoding='utf-8') as f:
    for line in f:
        do_stuff(line)

Much simpler, and more efficient, and more portable, than shelling out to zcat . zcat更简单,更高效,更便携。


But if you really do want to launch a subprocess and control its pipes in Python, you can do it with the subprocess module. 但是,如果您确实想要在Python中启动子流程并控制其管道,则可以使用subprocess模块执行此操作。 And, unlike perl, Python can do this without having to stick a shell in the middle. 而且,与perl不同,Python可以做到这一点,而不必在中间粘贴一个shell。 There's even a nice section in the docs on Replacing Older Functions with the subprocess Module that gives you recipes. 使用subprocess模块替换旧函数的文档中甚至还有一个很好的部分,它为您提供了配方。

So: 所以:

zcat = subprocess.Popen(['zcat', path], stdout=subprocess.PIPE)

Now, zcat.stdout is a file-like object, with the usual read methods and so on, wrapping the pipe to the zcat subprocess. 现在, zcat.stdout是一个类文件对象,使用常用的read方法等,将管道包装到zcatzcat

So, for example, to read a binary file 8K at a time in Python 3.x: 因此,例如,在Python 3.x中一次读取一个二进制文件8K:

zcat = subprocess.Popen(['zcat', path], stdout=subprocess.PIPE)
for chunk in iter(functools.partial(zcat.stdout.read, 8192), b''):
    do_stuff(chunk)
zcat.wait()

(If you want to do this in Python 2.x, or read a text file one line at a time instead of a binary file 8K at a time, or whatever, the changes are the same as they'd be for any other file-handling coding.) (如果你想在Python 2.x中执行此操作,或者一次读取一行文本文件而不是一次读取8K二进制文件,或者其他任何内容,则更改与其他任何文件的更改相同 - 处理编码。)

You can parse the whole file and load it as a list of lines using this function: 您可以使用此函数解析整个文件并将其作为行列表加载:

    def convert_gz_to_list_of_lines(filepath):
     """Parse gz file and convert it into a list of lines."""
     file_as_list = list()
     with gzip.open(filepath, 'rt', encoding='utf-8') as f:
      try:
       for line in f:
        file_as_list.append(line)
      except EOFError:
        file_as_list = file_as_list
      return file_as_list

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM