简体   繁体   中英

How can I find the number of the first base of a gene in a FASTA file?

In order to manually modify a .gff file I have, I need to find the start position of my gene in the FASTA-formatted genome of my animal (ie what # base is it in the sequence?). I have the sequence of this gene.

How do I do this as easily as possible (this is not an animal whose genome is readily available on the internet)?

What I have: the genome, in FASTA format; a GFF file containing an annotation for this organism's genome (which needs to be sorely updated); the sequence of this gene.

Thank you!

If you know that the gene sequence is identical to that in the reference, do (using python)

import re
match = re.search(your_gene_seq, your_genome_seq)
if match:
    gene_start = match.start()
else:
    print("no match")

Otherwise, you will need to do a pairwise alignment of your gene to the reference

using Biopython:

python -m pip install biopython

from Bio import pairwise2
# alignment scores: match = 5, mismatch = -4, gap open = -2, gap extend = -0.5
alignment = pairwise2.align.globalms(your_gene_seq, your_genome_seq, 5, -4, -2, -0.5)[0]
gene_start = alignment[3]

to update the gff

use biopython

https://biopython.org/wiki/GFF_Parsing

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM