简体   繁体   English

逐行读取gzip压缩文本文件,以便在python 3.2.6中进行处理

[英]Reading gzipped text file line-by-line for processing in python 3.2.6

I'm a complete newbie when it comes to python, but I've been tasked with trying to get a piece of code running on a machine which has a different version of python (3.2.6) than that which the code was originally built for. 对于python来说,我是一个完全新手,但我的任务是尝试在一台机器上运行一段代码,该机器具有与最初构建代码不同版本的python(3.2.6)对于。

I've come across an issue with reading in a gzipped-text file line-by-line (and processing it depending on the first character). 我遇到了一个逐行阅读gzip文本文件的问题(并根据第一个字符处理它)。 The code (which obviously is written in python > 3.2.6) is 代码(显然是用python> 3.2.6编写的)是

for line in gzip.open(input[0], 'rt'):
    if line[:1] != '>':
        out.write(line)
        continue

    chromname = match2chrom(line[1:-1])
    seqname = line[1:].split()[0]

    print('>{}'.format(chromname), file=out)
    print('{}\t{}'.format(seqname, chromname), file=mappingout)

(for those who know, this strips gzipped FASTA genome files into headers (with ">" at start) and sequences, and processes the lines into two different files depending on this) (对于那些知道,这条带将FASTA基因组文件压缩成标题(在开始时带有“>”)和序列,并根据此处理行分为两个不同的文件)

I have found https://bugs.python.org/issue13989 , which states that mode 'rt' cannot be used for gzip.open in python-3.2 and to use something along the lines of: 我找到了https://bugs.python.org/issue13989 ,它声明模式'rt'不能用于python-3.2中的gzip.open并使用以下内容:

import io

with io.TextIOWrapper(gzip.open(input[0], "r")) as fin:
     for line in fin:
         if line[:1] != '>':
             out.write(line)
             continue

         chromname = match2chrom(line[1:-1])
         seqname = line[1:].split()[0]

         print('>{}'.format(chromname), file=out)
         print('{}\t{}'.format(seqname, chromname), file=mappingout)

but the above code does not work: 但上面的代码不起作用:

UnsupportedOperation in line <4> of /path/to/python_file.py:
read1

How can I rewrite this routine to give out exactly what I want - reading the gzip file line-by-line into the variable "line" and processing based on the first character? 我怎样才能重写这个例程来准确地给出我想要的东西 - 将gzip文件逐行读入变量“line”并根据第一个字符进行处理?

EDIT: traceback from the first version of this routine is (python 3.2.6): 编辑:从这个例程的第一个版本回溯是(python 3.2.6):

Mode rt not supported  
File "/path/to/python_file.py", line 79, in __process_genome_sequences  
File "/opt/python-3.2.6/lib/python3.2/gzip.py", line 46, in open  
File "/opt/python-3.2.6/lib/python3.2/gzip.py", line 157, in __init__

Traceback from the second version is: 第二个版本的回溯是:

UnsupportedOperation in line 81 of /path/to/python_file.py:
read1
File "/path/to/python_file.py", line 81, in __process_genome_sequences

with no further traceback (the extra two lines in the line count are the import io and with io.TextIOWrapper(gzip.open(input[0], "r")) as fin: lines 没有进一步的追溯(行数中的额外两行是import iowith io.TextIOWrapper(gzip.open(input[0], "r")) as fin: lines

I have actually appeared to have solved the problem. 我实际上似乎已经解决了这个问题。

In the end I had to use shell("gunzip {input[0]}") to ensure that the gunzipped file could be read in in text mode, and then read in the resulting file using 最后我不得不使用shell("gunzip {input[0]}")来确保可以在文本模式下读入gunzipped文件,然后使用shell("gunzip {input[0]}")读取结果文件

for line in open(' *< resulting file >* ','r'):
    if line[:1] != '>':
        out.write(line)
        continue

    chromname = match2chrom(line[1:-1])
    seqname = line[1:].split()[0]

    print('>{}'.format(chromname), file=out)
    print('{}\t{}'.format(seqname, chromname), file=mappingout)  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM