简体   繁体   English

限制Biopython NCBIWWW搜索中的命中数

[英]Limiting the number of hits in a Biopython NCBIWWW Search

I'm working on trying to automate some BLAST searches. 我正在尝试自动执行一些BLAST搜索。 I need to pick up only the top three results from the BLAST results, however the parameter hitlist_size doesn't seem to be limiting my searches to only three results. 我只需要从BLAST结果中选择前三个结果,但是参数hitlist_size似乎并不限制我的搜索只限于三个结果。 No matter what size I specify I still end up getting > 3 hits for most samples (in others I get three, although I'm not sure if this is simply coincidence) . 无论我指定的大小如何,大多数样品最终还是会获得> 3次命中(在其他样品中,我会得到3次命中,尽管我不确定这是否仅仅是巧合)。 Does anyone have any insight to this? 有人对此有见识吗?

import Bio
from Bio import SeqIO
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
fasta_string = open("FASTA_Files/48_50.fasta").read()
result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string,hitlist_size=3)
    with open("XML_Files/48_50.xml", "w") as save_to:
        save_to.write(result_handle.read())
        result_handle.close()

The following FASTA sequence produces the expected number of hits for 46 but exceeds the expected number of hits for 47 : 以下FASTA序列产生46的预期命中数,但超过47的预期命中数:

>46
NNNNNNNNNNGNNNNNNTGCAGTCGNANNNNNNNNNNNNNNNNNAGCTTGCTGCTTCGCTGACGAGTGGCGGACGGGTGA
GTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTGGCTAATACCGCATAACGTCGCAAGACCAA
AGAGGGGGACCTTCGGGCCTCTTGCCATCAGATGTGCCCAGATGGGATTAGCTTGTTGGTGAGGTAACGGCTCACCAAGG
CGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCA
GTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTAC
TTTCAGCGGGGAGGAAGGNGTTGTGGTTAATAACCGCAGCAATTGACGTTACCCGCANAANAAGCACCGGCTAACTCCGT
GCCAGCAGCCGCGGTAATACGGAGGGTGCANGCGTTAATCGNAATTACTGGNCGTAAAGCGCACGCAGGCGGTCTGTCAA
GTCTGATGTGAAATCCCCGGGCTCAACCTGGNAACTGCATTCNAAACTGGCAGGCTTGAGTCTTGNAGANGGGGGGTAGA
ATTCCAGGTGTANCGNTGAAATGCGTANAGATCTGGANGNAATACCGGTGGCNAANGCGGCCCCCTGGACAAAGACTGAC
GCTCANGTGCGAAAGCGTGGNGGAGCAAANAGGATTAGATACCCTGGTAGNCCNNCGCCNNANACGATGTCTACNTGGNA
GGNTTGTGNCCTTGAGGCGTGNCTNNCCNGNAGCTNAACGCGTTTAAGTANANCNNCCTGGGGCGAGCNACGGGCNGCNN
GGNTNAAAACNNNNNNTGNNNTTNNACGGGNGNNCCCCGCANNANCCGGCTGNNAGCATGNTGGATTNAANTTCGATNNN
NNCGCGAAGAANCCNNANNNNNGNNNNNNNANNNNNNNNNNAANNNTNNNNNANANNNNNNNNNNNNNNNGNNNNCTNNN
AGNANNGNNNCTGCATGGNNNNCNNNCNNGNTCNNGNNNNNNNNNNNNNGNNNNNNNNCCNNNANNNNNNNNNNNNTNAT
CNNNNNNNNNNNNNTNNNNNNNNNNNNNAGNNNNNNNNCNNNNNNNNNNANNTNNNAANN
>47
NNNNNNNNNNNNNNNNNNNNNNNNGNNNNNCGGGTNNCCNNNNGCTGGNTGNNNTGCTGACGAGTGGCGGACGGGTGAGT
AATGTCTGGGAAACTGCCTGAAGGGGGGGGATTCCTACTGGCCAGGGTGGCTAATACCGGGTAACGTCGNNNGANCAAAG
AGGGGGACCTTCGGGCCTCTTGCCATCACATGTGCCCGGATGGGATTAGCTTGTTGGTGAGGTAACGGCTCACCAAGGCG
ACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCNCACTGGAACTGAGACACGGTCCCGACTCCTACGGGAGGCANCAGT
GGGGAATATTGCTCTTGGGCGCAAGCCTGATGCAGCCATGCCGCGGGTATGAGGAAGGCCTTCGGTTTGTAAAGTACTTT
CTCCGGGGAGGAAGGNGTNGTGGTGAATAACCGCTACANTTGANNCTNCCCGCNNAANAACCACCNGNTAACTCCNTGCN
NNNNGCCGCGGTAATACGGANGGTGCAAGNGTTAATCGNANTTACTGNNTGTTGAGCGCACGNNGGCGGCCTGTCNNNTC
TNATGTGAGATCCCCGGGCTCNCCCTGNNACCTGCATTCGNNNNNTGNNANGCTNGANTCTTGNNNNGNNGNGGNAGNAA
TTCCNNGTGTNNCGNNGAAATGCNNANAGATCTGNANANANNACNGGNGNCCAANGNNGNCCCCTGNTCTCNGACTGACG
CNNGAGTGCTGAANNGTGNAGAGCGNACAGGATTANANNNCCNGNTAGNCCGNCNCCNCACACCNATGTCTACATGNGAG
GNTNNNGNNNNNTGNGGCNNNNCNNTCCNNNAGCTNANGNNGTTNAANTANATCNNNCTNNNNCNNGCNNGGGGNCANNA
NGGGNNNAAANNTNNNNATNAATNTGACGGANNNNNCNNNNNNNNCNNNNNNANCATGNGGATTNANNNTNNNTNNNNNC
NNNNNANAACCNNANNNNNNNNNNNNNNTNNNNNNNANNNTNNNNNNNTNNNNNNGNNNNCNNNNNNNNACTNNNNNNNC
NNNNNNNNCANNGNNNNNNNNNNNANGNTNNNNNTGNNNNAANNNTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTN
NNNNNNTNNNNNNNTNNNNTNNNNNNNNNNNNNNNNNNNNNNNTNNNNNNNNTCNNNNNN

The problem seems that a hit is something different than an alignment . 问题似乎是命中不同于对齐 A hit is a sequence match in the database, whereas an alignment is the actual position of the nucleotides. 命中是数据库中的序列匹配,而比是核苷酸的实际位置。 This could happen several times for the same sequence in the database. 对于数据库中的相同序列,可能会发生多次。

In the case you tried, the xml that this saves has indeed three items under <Hit> , but the last hit you get has seven <Hsp> records, which seem to be what you are iterating through with the loop: 在您尝试的情况下,此保存的xml实际上在<Hit>下有3个项目,但是您最后获得的命中有7个 <Hsp>记录,这似乎是您在循环中迭代的内容:

for rec in blast_records:
    for alignment in rec.alignments:
        for hsp in alignment.hsps:
            print(rec.query) #DNA ID

Specifying alignments=3 instead of hits will get you 3 alignments maximum per hit, I think. 我认为,指定alignments=3而不是匹配数,则每次匹配最多可获得3个对齐方式。 You can specify both, if you want to control number of hits and number of alignments. 如果要控制命中数和比对数,可以同时指定。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM