[英]How can I find the number of the first base of a gene in a FASTA file?
In order to manually modify a .gff file I have, I need to find the start position of my gene in the FASTA-formatted genome of my animal (ie what # base is it in the sequence?).为了手动修改我拥有的 .gff 文件,我需要在我的动物的 FASTA 格式的基因组中找到我的基因的起始位置(即它在序列中的 # 碱基是什么?)。 I have the sequence of this gene.
我有这个基因的序列。
How do I do this as easily as possible (this is not an animal whose genome is readily available on the internet)?我如何尽可能轻松地做到这一点(这不是一种可以在互联网上轻松获得基因组的动物)?
What I have: the genome, in FASTA format;我所拥有的:基因组,FASTA 格式; a GFF file containing an annotation for this organism's genome (which needs to be sorely updated);
包含该生物基因组注释的 GFF 文件(需要非常更新); the sequence of this gene.
这个基因的序列。
Thank you!谢谢!
If you know that the gene sequence is identical to that in the reference, do (using python)如果您知道基因序列与参考中的相同,请执行(使用 python)
import re
match = re.search(your_gene_seq, your_genome_seq)
if match:
gene_start = match.start()
else:
print("no match")
Otherwise, you will need to do a pairwise alignment of your gene to the reference否则,您将需要将您的基因与参考进行成对比对
using Biopython:使用Biopython:
python -m pip install biopython
from Bio import pairwise2
# alignment scores: match = 5, mismatch = -4, gap open = -2, gap extend = -0.5
alignment = pairwise2.align.globalms(your_gene_seq, your_genome_seq, 5, -4, -2, -0.5)[0]
gene_start = alignment[3]
to update the gff更新 gff
use biopython使用生物蟒蛇
https://biopython.org/wiki/GFF_Parsing https://biopython.org/wiki/GFF_Parsing
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.