Python：如何根据具有二进制内容的文本文件提取DNA序列？

Question

For example I have a fasta file with the following sequences: 例如，我有一个具有以下顺序的fasta文件：

>human1
AGGGCGSTGC
>human2
GCTTGCGCTAG
>human3
TTCGCTAG

How to use python to read a text file with the following content to extract the sequences? 如何使用python读取具有以下内容的文本文件以提取序列？ 1 represents true and 0 represents false. 1代表真，0代表假。 Only sequence with value 1 will be extracted. 仅提取值为1的序列。

Example text file: 示例文本文件：

0
1
1

Expected output: 预期产量：

>human2
GCTTGCGCTAG
>human3
TTCGCTAG

Answer 1

for this is better to use biopython 因为这更好地使用biopython

from Bio import SeqIO

mask = ["1"==_.strip() for _ in open("mask.txt")]
seqs = [seq for seq in SeqIO.parse(open("input.fasta"), "fasta")]
seqs_filter = [seq for flag, seq in zip(mask, seqs) if flag]
for seq in seqs_filter:
  print seq.format("fasta")

you get: 你得到：

>human2
GCTTGCGCTAG
>human3
TTCGCTAG

explanation 说明

parse fasta: the format fasta may to have several lines of sequences (check fasta format ), is better to use a specialized library to read (parser) and write the output 快速分析法：快速法格式可能具有几行序列（请检查快速法格式），最好使用专门的库来读取（解析器）并写入输出

mask: I read de mask file and cast to boolean [False, True, True] mask：我读取了mask文件并转换为boolean [False, True, True]

filter : use zip function for each sequence match with his mask, and following i use list Comprehensions to filter filter ：对每个与他的面具匹配的序列使用zip函数，然后我使用list Comprehensions进行过滤

Answer 2

I think this may help you and I really think you should take some time learn Python. 我认为这可能对您有所帮助，我真的认为您应该花一些时间来学习Python。 Python is a good language for bioinformatics. Python是生物信息学的好语言。

display = []
with open('test.txt') as f:
    for line in f.readlines():
        display.append(int(line.strip()))

output_DNA = []
with open('XX.fasta') as f:
    index = -1
    for line in f.readlines():
        if line[0] == '>':
            index = index + 1

        if display[index]:
            output_DNA.append(line)

print output_DNA

Answer 3

You can create an list to act like a mask for when you read your fasta file: 您可以创建一个列表，使其在阅读fasta文件时像面具一样：

with open('mask.txt') as mf:
    mask = [ s.strip() == '1' for s in mf.readlines() ]

Then: 然后：

with open('seq.fasta') as f:
    for i, line in enumerate(f):
        if mask[i]:
            *something* line

or: 要么：

from itertools import izip

for b, line in izip(open(mask_file), open(seq_file)):
    if b.strip() == '1':
          *something* line

Answer 4

I am unfamiliar with the fasta file format specifically but hopefully this helps. 我特别不熟悉fasta文件格式，但希望这会有所帮助。 You can open your file in python the following way and extract the valid line entries in a list. 您可以通过以下方式在python中打开文件，然后将有效行条目提取到列表中。

valid = []
with open('test.txt') as f:
    all_lines = f.readlines() # get all the lines
    all_lines = [x.strip() for x in all_lines] # strip away newline chars
    for i in range(len(all_lines)):
        if all_lines[i] == '1': # if it matches our condition
            valid.append(i) # add the index to our list

    print valid # or get only the fasta file contents on these lines

I ran it with the following text file test.txt: 我使用以下文本文件test.txt来运行它：

And got output when printing valid : 并在打印valid时得到输出：

[1, 2, 3, 6, 7]

I think this will help you move along, but please let me know in the comments if you need an expanded answer. 我认为这将帮助您前进，但是如果您需要扩展的答案，请在评论中让我知道。

Python：如何根据具有二进制内容的文本文件提取DNA序列？

问题描述

4 个解决方案

解决方案1
5 已采纳 2015-05-20 16:17:57

解决方案2
3 2015-05-20 16:00:27

解决方案3
1 2015-05-20 15:43:13

解决方案4
0 2015-05-20 15:29:18

Python：如何根据具有二进制内容的文本文件提取DNA序列？

问题描述

4 个解决方案

解决方案1 5 已采纳 2015-05-20 16:17:57

解决方案2 3 2015-05-20 16:00:27

解决方案3 1 2015-05-20 15:43:13

解决方案4 0 2015-05-20 15:29:18

解决方案1
5 已采纳 2015-05-20 16:17:57

解决方案2
3 2015-05-20 16:00:27

解决方案3
1 2015-05-20 15:43:13

解决方案4
0 2015-05-20 15:29:18