简体   繁体   English

如何在不使用 Biopython 的情况下找到 FASTA 数据集中的所有序列长度

[英]How do I find all Sequence Lengths in a FASTA Dataset without using the Biopython

Let's say we have a FASTA file like this:假设我们有一个这样的 FASTA 文件:

>header1
ASDTWQEREWQDDSFADFASDASDQWEFQW
>header2
ASFECAERVA
>header3
ACTGQSDFWGRWTFSH

and this is my desired output:这是我想要的 output:

header1 30
header2 10
header3 16

Please answer this question without using Biopython.请在不使用 Biopython 的情况下回答这个问题。 I think re.match('^>') can be used here to distinguish the header line and other sequence lines (need to import re first) but I need some helps for the rest part to get the output.我认为这里可以使用re.match('^>')来区分 header 行和其他序列行(需要先导入 re),但我需要 rest 部分的帮助才能获得 Z78E6221F6393D1356CZ6681。 Only use python, no Biopython.只使用 python,没有 Biopython。 Thank you!谢谢!

Here is my current code:这是我当前的代码:

import re
path='/home/try.txt'
count=0
seq=''
with open(path) as d_g:
    for line in d_g:
        if re.match('^>',line):
            count=count+1
            print('head',count)
        else:
            seq=seq+line

You really don't need regular expressions for this.你真的不需要正则表达式。

header = None
length = 0
with open('file.fasta') as fasta:
    for line in fasta:
        # Trim newline
        line = line.rstrip()
        if line.startswith('>'):
            # If we captured one before, print it now
            if header is not None:
                print(header, length)
                length = 0
            header = line[1:]
        else:
            length += len(line)
# Don't forget the last one
if length:
    print(header, length)

This uses very basic Python idioms which should be easy to pick up from any introduction to text processing in Python.这使用了非常基本的 Python 习语,应该很容易从 Python 中的任何文本处理介绍中学习。 Very briefly, we remember what we have seen so far, and print what we remember before starting to remember a new record (which happens when we see a header line).非常简单地说,我们记住到目前为止所看到的,并在开始记住新记录之前打印我们记得的内容(当我们看到 header 行时会发生这种情况)。 A common bug is forgetting to print the last record when we reach the end of the file, which of course doesn't have a new header line.一个常见的错误是当我们到达文件末尾时忘记打印最后一条记录,这当然没有新的 header 行。

If you can guarantee that all sequences are on a single line, this could be simplified radically, but I wanted to present a general solution.如果你能保证所有序列都在一行上,这可以从根本上简化,但我想提出一个通用的解决方案。

collections.Counter() and a simple loop with state to the rescue. collections.Counter()和一个带有 state 的简单循环来救援。

I augmented your test data to have a multi-line sequence too, just so we can tell things work right.我也扩充了您的测试数据以具有多行序列,以便我们可以判断事情是否正常。

import io
import collections

# Using a stringio to emulate a file
data = io.StringIO("""
>header1
ASDTWQEREWQDDSFADFASDASDQWEFQW
DFASDASDQWEFQWASDTWQEREWQDDSFA
>header2
ASFECAERVA
>header3
>header4
ACTGQSDFWGRWTFSH
""".strip())


counter = collections.Counter()

header = None
for line in data:
    line = line.strip()  # deal with newlines
    if line.startswith(">"):
        header = line[1:]
        continue
    counter[header] += len(line)

print(counter)

This prints out这打印出来

Counter({'header1': 60, 'header4': 16, 'header2': 10})

Do not reinvent the wheel.不要重新发明轮子。 There are many specialized bioinformatics tools to handle fasta files.有许多专门的生物信息学工具可以处理 fasta 文件。

For example, use infoseq utility from the EMBOSS package:例如,使用EMBOSS package 中的infoseq实用程序:

infoseq -auto -only -name -length -noheading file.fasta

Prints:印刷:

header1        30     
header2        10     
header3        16     

Install EMBOSS , for example, using conda :安装EMBOSS ,例如,使用conda

conda create --name emboss emboss

From the man page:从手册页:

infoseq -help
   -only               boolean    [N] This is a way of shortening the command
                                  line if you only want a few things to be
                                  displayed. Instead of specifying:
                                  '-nohead -noname -noacc -notype -nopgc
                                  -nodesc'
                                  to get only the length output, you can
                                  specify
                                  '-only -length'
   -[no]heading        boolean    [Y] Display column headings
   -name               boolean    [@(!$(only))] Display 'name' column
   -length             boolean    [@(!$(only))] Display 'length' column

Here is a simple for loop you can use:这是一个简单的 for 循环,您可以使用:

for i in f:
    if i[0] == ">":
        x = i.replace(">", "").rstrip()
    if i[0] != ">":
        y = len(i)
        print(x, y) ```

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 Biopython 找到蛋白质的核苷酸序列? - How do I find the nucleotide sequence of a protein using Biopython? 使用 Biopython 查找和提取与精确 DNA 序列匹配的 FASTA - Using Biopython to find and extract FASTA matches to exact DNA sequence 如何在不使用 biopython 的情况下编写脚本来汇总多 fasta 文件中的信息? - How can I write a script to summarise information from a multi-fasta file without using biopython? 如何在不使用Biopython的情况下从FASTA文件获取此输出? - How can I get this output from FASTA file without using Biopython? 无需使用BioPython即可读取FASTA中的核苷酸 - Read nucleotides in FASTA without using BioPython 如何在没有biopython的情况下将多行fasta文件转换为单行fasta文件 - How to convert multiline fasta files to singleline fasta files without biopython 如何将多 fasta 文件拆分为相等序列长度的块并使用 biopython 更改标题 - How to split a multi-fasta file into chunks of equal sequence length AND change the headers using biopython 使用Biopython(Python)从FASTA文件中提取序列 - Using Biopython (Python) to extract sequence from FASTA file 使用Biopython Entrez从Fasta记录访问序列元素 - Access sequence element from fasta record using Biopython Entrez 如何使用 biopython 获得多序列 alignment 的共有序列? - How do I get a consensus sequence of a multiple sequence alignment using biopython?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM