如何在不使用 Biopython 的情况下找到 FASTA 数据集中的所有序列长度

Question

Let's say we have a FASTA file like this:假设我们有一个这样的 FASTA 文件：

>header1
ASDTWQEREWQDDSFADFASDASDQWEFQW
>header2
ASFECAERVA
>header3
ACTGQSDFWGRWTFSH

and this is my desired output:这是我想要的 output：

header1 30
header2 10
header3 16

Please answer this question without using Biopython.请在不使用 Biopython 的情况下回答这个问题。 I think re.match('^>') can be used here to distinguish the header line and other sequence lines (need to import re first) but I need some helps for the rest part to get the output.我认为这里可以使用re.match('^>')来区分 header 行和其他序列行（需要先导入 re），但我需要 rest 部分的帮助才能获得 Z78E6221F6393D1356CZ6681。 Only use python, no Biopython.只使用 python，没有 Biopython。 Thank you!谢谢！

Here is my current code:这是我当前的代码：

import re
path='/home/try.txt'
count=0
seq=''
with open(path) as d_g:
    for line in d_g:
        if re.match('^>',line):
            count=count+1
            print('head',count)
        else:
            seq=seq+line

Answer 1

You really don't need regular expressions for this.你真的不需要正则表达式。

header = None
length = 0
with open('file.fasta') as fasta:
    for line in fasta:
        # Trim newline
        line = line.rstrip()
        if line.startswith('>'):
            # If we captured one before, print it now
            if header is not None:
                print(header, length)
                length = 0
            header = line[1:]
        else:
            length += len(line)
# Don't forget the last one
if length:
    print(header, length)

This uses very basic Python idioms which should be easy to pick up from any introduction to text processing in Python.这使用了非常基本的 Python 习语，应该很容易从 Python 中的任何文本处理介绍中学习。 Very briefly, we remember what we have seen so far, and print what we remember before starting to remember a new record (which happens when we see a header line).非常简单地说，我们记住到目前为止所看到的，并在开始记住新记录之前打印我们记得的内容（当我们看到 header 行时会发生这种情况）。 A common bug is forgetting to print the last record when we reach the end of the file, which of course doesn't have a new header line.一个常见的错误是当我们到达文件末尾时忘记打印最后一条记录，这当然没有新的 header 行。

If you can guarantee that all sequences are on a single line, this could be simplified radically, but I wanted to present a general solution.如果你能保证所有序列都在一行上，这可以从根本上简化，但我想提出一个通用的解决方案。

Answer 2

collections.Counter() and a simple loop with state to the rescue. collections.Counter()和一个带有 state 的简单循环来救援。

I augmented your test data to have a multi-line sequence too, just so we can tell things work right.我也扩充了您的测试数据以具有多行序列，以便我们可以判断事情是否正常。

import io
import collections

# Using a stringio to emulate a file
data = io.StringIO("""
>header1
ASDTWQEREWQDDSFADFASDASDQWEFQW
DFASDASDQWEFQWASDTWQEREWQDDSFA
>header2
ASFECAERVA
>header3
>header4
ACTGQSDFWGRWTFSH
""".strip())


counter = collections.Counter()

header = None
for line in data:
    line = line.strip()  # deal with newlines
    if line.startswith(">"):
        header = line[1:]
        continue
    counter[header] += len(line)

print(counter)

This prints out这打印出来

Counter({'header1': 60, 'header4': 16, 'header2': 10})

Answer 3

Do not reinvent the wheel.不要重新发明轮子。 There are many specialized bioinformatics tools to handle fasta files.有许多专门的生物信息学工具可以处理 fasta 文件。

For example, use infoseq utility from the EMBOSS package:例如，使用EMBOSS package 中的infoseq实用程序：

infoseq -auto -only -name -length -noheading file.fasta

Prints:印刷：

header1        30     
header2        10     
header3        16

Install EMBOSS , for example, using conda :安装EMBOSS ，例如，使用conda ：

conda create --name emboss emboss

From the man page:从手册页：

infoseq -help
   -only               boolean    [N] This is a way of shortening the command
                                  line if you only want a few things to be
                                  displayed. Instead of specifying:
                                  '-nohead -noname -noacc -notype -nopgc
                                  -nodesc'
                                  to get only the length output, you can
                                  specify
                                  '-only -length'
   -[no]heading        boolean    [Y] Display column headings
   -name               boolean    [@(!$(only))] Display 'name' column
   -length             boolean    [@(!$(only))] Display 'length' column

Answer 4

Here is a simple for loop you can use:这是一个简单的 for 循环，您可以使用：

for i in f:
    if i[0] == ">":
        x = i.replace(">", "").rstrip()
    if i[0] != ">":
        y = len(i)
        print(x, y) ```

如何在不使用 Biopython 的情况下找到 FASTA 数据集中的所有序列长度

问题描述

4 个解决方案

解决方案1
3 2021-10-02 09:47:27

解决方案2
1 2021-10-02 09:48:54

解决方案3
0 2021-10-04 21:28:57

解决方案4
0 2022-09-17 14:58:33

如何在不使用 Biopython 的情况下找到 FASTA 数据集中的所有序列长度

问题描述

4 个解决方案

解决方案1 3 2021-10-02 09:47:27

解决方案2 1 2021-10-02 09:48:54

解决方案3 0 2021-10-04 21:28:57

解决方案4 0 2022-09-17 14:58:33

解决方案1
3 2021-10-02 09:47:27

解决方案2
1 2021-10-02 09:48:54

解决方案3
0 2021-10-04 21:28:57

解决方案4
0 2022-09-17 14:58:33