[英]How do I find all Sequence Lengths in a FASTA Dataset without using the Biopython
Let's say we have a FASTA file like this:假设我们有一个这样的 FASTA 文件:
>header1
ASDTWQEREWQDDSFADFASDASDQWEFQW
>header2
ASFECAERVA
>header3
ACTGQSDFWGRWTFSH
and this is my desired output:这是我想要的 output:
header1 30
header2 10
header3 16
Please answer this question without using Biopython.请在不使用 Biopython 的情况下回答这个问题。 I think re.match('^>')
can be used here to distinguish the header line and other sequence lines (need to import re first) but I need some helps for the rest part to get the output.我认为这里可以使用re.match('^>')
来区分 header 行和其他序列行(需要先导入 re),但我需要 rest 部分的帮助才能获得 Z78E6221F6393D1356CZ6681。 Only use python, no Biopython.只使用 python,没有 Biopython。 Thank you!谢谢!
Here is my current code:这是我当前的代码:
import re
path='/home/try.txt'
count=0
seq=''
with open(path) as d_g:
for line in d_g:
if re.match('^>',line):
count=count+1
print('head',count)
else:
seq=seq+line
You really don't need regular expressions for this.你真的不需要正则表达式。
header = None
length = 0
with open('file.fasta') as fasta:
for line in fasta:
# Trim newline
line = line.rstrip()
if line.startswith('>'):
# If we captured one before, print it now
if header is not None:
print(header, length)
length = 0
header = line[1:]
else:
length += len(line)
# Don't forget the last one
if length:
print(header, length)
This uses very basic Python idioms which should be easy to pick up from any introduction to text processing in Python.这使用了非常基本的 Python 习语,应该很容易从 Python 中的任何文本处理介绍中学习。 Very briefly, we remember what we have seen so far, and print what we remember before starting to remember a new record (which happens when we see a header line).非常简单地说,我们记住到目前为止所看到的,并在开始记住新记录之前打印我们记得的内容(当我们看到 header 行时会发生这种情况)。 A common bug is forgetting to print the last record when we reach the end of the file, which of course doesn't have a new header line.一个常见的错误是当我们到达文件末尾时忘记打印最后一条记录,这当然没有新的 header 行。
If you can guarantee that all sequences are on a single line, this could be simplified radically, but I wanted to present a general solution.如果你能保证所有序列都在一行上,这可以从根本上简化,但我想提出一个通用的解决方案。
collections.Counter()
and a simple loop with state to the rescue. collections.Counter()
和一个带有 state 的简单循环来救援。
I augmented your test data to have a multi-line sequence too, just so we can tell things work right.我也扩充了您的测试数据以具有多行序列,以便我们可以判断事情是否正常。
import io
import collections
# Using a stringio to emulate a file
data = io.StringIO("""
>header1
ASDTWQEREWQDDSFADFASDASDQWEFQW
DFASDASDQWEFQWASDTWQEREWQDDSFA
>header2
ASFECAERVA
>header3
>header4
ACTGQSDFWGRWTFSH
""".strip())
counter = collections.Counter()
header = None
for line in data:
line = line.strip() # deal with newlines
if line.startswith(">"):
header = line[1:]
continue
counter[header] += len(line)
print(counter)
This prints out这打印出来
Counter({'header1': 60, 'header4': 16, 'header2': 10})
Do not reinvent the wheel.不要重新发明轮子。 There are many specialized bioinformatics tools to handle fasta files.有许多专门的生物信息学工具可以处理 fasta 文件。
For example, use infoseq
utility from the EMBOSS
package:例如,使用EMBOSS
package 中的infoseq
实用程序:
infoseq -auto -only -name -length -noheading file.fasta
Prints:印刷:
header1 30
header2 10
header3 16
Install EMBOSS
, for example, using conda
:安装EMBOSS
,例如,使用conda
:
conda create --name emboss emboss
From the man page:从手册页:
infoseq -help
-only boolean [N] This is a way of shortening the command
line if you only want a few things to be
displayed. Instead of specifying:
'-nohead -noname -noacc -notype -nopgc
-nodesc'
to get only the length output, you can
specify
'-only -length'
-[no]heading boolean [Y] Display column headings
-name boolean [@(!$(only))] Display 'name' column
-length boolean [@(!$(only))] Display 'length' column
Here is a simple for loop you can use:这是一个简单的 for 循环,您可以使用:
for i in f:
if i[0] == ">":
x = i.replace(">", "").rstrip()
if i[0] != ">":
y = len(i)
print(x, y) ```
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.