简体   繁体   中英

BioPython: How to Parse by “Locus” key in GenBank

I have a Genbank file containing a number of sequences. I have a second text file that contains the names of these sequences, as well as some other information about them, in a TSV, which I read in as a pandas dataframe. I used the.sample function to randomly select a name from this data, which i assigned the variable n_name , as shown in the block of code below.

n = df_bp_pos_2.sample(n = 1)
n_value = n.iloc[:2]
n_name = n.iloc[:1]

n_name is equal to the Locus name in the genbank file and is case accurate. I am trying to parse through the genbank file and extract the sequence that has locus = n_name . The genbank file is named all.gb . I have:

from Bio import SeqIO
for seq_record in SeqIO.parse("all.gb", "genbank"):

But I am not too sure what the next line or 2 should be, to parse by locus? Any ideas?

You could also use a list of locus tags instead of just one locus tag.

from Bio import SeqIO

locus_tags = ["b0001", "b0002"] # Example list of locus tags
records = []

for record in SeqIO.parse('all.gb', 'genbank'):
    for feature in record.features:
        tag = feature.qualifiers.get('locus_tag')
        if tag:
            if tag[0] in locus_tags:
                # Here you need to either extract the feature sequence from the record (using the extract method) if you only want the feature dna sequence, or alternatively get the translation for the protein by accession the 'translation' field in qualifiers, or make a translation of the feature on the fly. Afterwards you canappend the resulting record to `records`.

You can find more about the extract method and the feature qualifiers in the Biopython Cookbook .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM