如何提高生物信息学脚本的速度？

Question

I'm developing a python script for bioinformatic analysis.我正在开发一个用于生物信息学分析的 python 脚本。 First, the script reads the entire file (.fasta – which is basically a very long string) to find all the scaffolds (lines that starts with '>'), then it prints out the number of scaffolds found.首先，脚本读取整个文件（.fasta – 这基本上是一个很长的字符串）以查找所有脚手架（以“>”开头的行），然后打印出找到的脚手架数量。 I have two similar input files .fasta, one with over 1.5GB that runs in less than a minute, and a second one with 85MB that takes more than 31 HOURS.我有两个类似的输入文件 .fasta，一个超过 1.5GB 的文件在不到一分钟的时间内运行，另一个有 85MB 的文件需要超过 31 小时。

import sys

cabecalho = []
sequencia = []
contador = -1
file_open = open('C:\PYTHON\Chr09.fasta', "r")
for line in file_open:
    line = line.rstrip()
    if ">" in line:
        cabecalho.append(line)
        contador += 1
        sequencia.insert(contador, '')
    else:
        sequencia[contador] += line
con = contador + 1
print(con)

What can I do to optimize the running speed of this script?我可以做些什么来优化这个脚本的运行速度？ Or how can I check what's wrong with the file (They have the same format, and the same text config.)?或者我如何检查文件有什么问题（它们具有相同的格式和相同的文本配置。）？

Answer 1

First of all, you don't need to reinvent the weel, Biopython can easly handle the fasta files, for example:首先，你不需要重新发明weel，Biopython可以轻松处理fasta文件，例如：

from Bio import SeqIO

myseqs = {}
fasta_sequences = SeqIO.parse(open('C:\PYTHON\Chr09.fasta'), 'fasta')
for fasta in fasta_sequences:
 name, sequence = fasta.description, str(fasta.seq)
 myseqs[name] = sequence

print("total sequences: "+len(myseqs))

In this way you will also have your sequences as a dict to easly access through the fasta header and do whatever you want.通过这种方式，您还将拥有您的序列作为字典，以便通过 fasta 标头轻松访问并做任何您想做的事情。

Finally, to install biopython just type最后，要安装 biopython 只需键入

pip install biopython pip 安装 biopython

Other approach without python, if you only want to know the number of scaffolds, you can do it in one line using grep command in unix enviroment另一种没有python的方法，如果你只想知道脚手架的数量，你可以在unix环境下使用grep命令一行完成

grep -c ">" myfasta.fasta grep -c ">" myfasta.fasta

-c is for count match only -c 仅用于计数匹配

Regards问候

如何提高生物信息学脚本的速度？

问题描述

1 个解决方案

解决方案1
1 2021-03-08 14:43:51

如何提高生物信息学脚本的速度？

问题描述

1 个解决方案

解决方案1 1 2021-03-08 14:43:51

解决方案1
1 2021-03-08 14:43:51