如何在 FASTA 文件中找到基因的第一个碱基的编号？

Question

In order to manually modify a .gff file I have, I need to find the start position of my gene in the FASTA-formatted genome of my animal (ie what # base is it in the sequence?).为了手动修改我拥有的 .gff 文件，我需要在我的动物的 FASTA 格式的基因组中找到我的基因的起始位置（即它在序列中的 # 碱基是什么？）。 I have the sequence of this gene.我有这个基因的序列。

How do I do this as easily as possible (this is not an animal whose genome is readily available on the internet)?我如何尽可能轻松地做到这一点（这不是一种可以在互联网上轻松获得基因组的动物）？

What I have: the genome, in FASTA format;我所拥有的：基因组，FASTA 格式； a GFF file containing an annotation for this organism's genome (which needs to be sorely updated);包含该生物基因组注释的 GFF 文件（需要非常更新）； the sequence of this gene.这个基因的序列。

Thank you!谢谢！

Answer 1

If you know that the gene sequence is identical to that in the reference, do (using python)如果您知道基因序列与参考中的相同，请执行（使用 python）

import re
match = re.search(your_gene_seq, your_genome_seq)
if match:
    gene_start = match.start()
else:
    print("no match")

Otherwise, you will need to do a pairwise alignment of your gene to the reference否则，您将需要将您的基因与参考进行成对比对

using Biopython:使用Biopython：

python -m pip install biopython

from Bio import pairwise2
# alignment scores: match = 5, mismatch = -4, gap open = -2, gap extend = -0.5
alignment = pairwise2.align.globalms(your_gene_seq, your_genome_seq, 5, -4, -2, -0.5)[0]
gene_start = alignment[3]

to update the gff更新 gff

use biopython使用生物蟒蛇

https://biopython.org/wiki/GFF_Parsing https://biopython.org/wiki/GFF_Parsing

如何在 FASTA 文件中找到基因的第一个碱基的编号？

问题描述

1 个解决方案

解决方案1
0 2019-11-12 09:55:35

如何在 FASTA 文件中找到基因的第一个碱基的编号？

问题描述

1 个解决方案

解决方案1 0 2019-11-12 09:55:35

解决方案1
0 2019-11-12 09:55:35