简体   繁体   中英

Only some of my species are being converted to NCBI IDs, using biopython to convert species to ID

I have some code which strips a species name from a list with underscores in, to a format appropriate for the NCBI, this then searches for the ID associated with that species name, however for some reason this isn't working with every entry in my input file. I have attached my code, a subset of the input file and a subset of the output file.

from Bio import Entrez
import time


Entrez.email = 'fake.email@isp.com'

def get_tax_id(species):
    species = species.replace('_', '+').strip()
    search = Entrez.esearch(term=species, db='taxonomy', retmode='xml')
    record = Entrez.read(search)
    return record['IdList']

current_time = time.strftime("%d.%m.%y %H:%M", time.localtime())

output_name = 'test#%s.txt' % current_time

file = open(output_name, "w+")

listoforganisms = [x.split('\t')[0] for x in open("OGTlist.csv").readlines()]

if __name__ == '__main__':
    organisms = listoforganisms
    for organism in organisms:
        taxid = get_tax_id(organism)
        stringid = str(taxid)
        strippedid = stringid.strip("'[]'")
        if len(stringid) <= 2:
            file.write('\n' + str(organism) + ',ERROR_no_ID_match')
        else:
            file.write('\n' + str(organism) + ',' + str(strippedid))

So this code prints a results file, and if the conversion works, prints the species name and the ID, and if not it just prints an error, my results file looks like this:

micromonospora_inyonensis,47866
viola_arvensis,97415
amycolatopsis_albidoflavus,102226
tetragenococcus_koreensis,290335
panaeolus_papilionaceus,330517
geomys_pinetis,100306
vibrio_lutjanus,ERROR_no_ID_match
succiniclasticum_ruminis,40841
microtetraspora_malaysiensis,161358
blarina_carolinensis,183658
amycolatopsis_palatopharyngis,187982
rhodosporidium_toruloides,5286
geobacter_bemidjiensis,225194
acinetobacter_haemolyticus,29430
actinoplanes_tereljensis,571912
phyllostomus_hastatus,9423
phacidium_infestans,66518
dorea_formicigenerans,39486
hoeflea_marina,274592
naemacyclus_minor,64355
methanosaeta_thermophila,2224
pholiota_carbonaria,227966
sphingomonas_faeni,185950
helicobacter_pullorum,35818
solitalea_koreensis,543615
dermacoccus_profundi,322602
pseudomonas_pictorum,86184
actinomadura_livida,79909
leptonycteris_curasoae,55054
psychrobacter_salsus,219741
vibrio_inusitatus,413402
stereum_rameale,ERROR_no_ID_match
photorhabdus_temperata,574560
clitocybe_lignatilis,5634
actinocorallia_glomerata,46203
aspergillus_giganteus,5060
erwinia_amylovora,552
hydrogenoanaerobacterium_saccharovorans,474960
mycobacterium_aichiense,1799
nocardia_pneumoniae,228601
bacillus_pocheonensis,363869
streptomonospora_alba,183763
exobasidium_gracile,190086
phenylobacterium_zucineum,284016
amsonia_tabernaemontana,144544
rattus_fuscipes,10119
jannaschia_rubra,282197
hereroa_rehneltiana,ERROR_no_ID_match

The file I'm getting the species names from looks like this:

micromonospora_inyonensis   28  DSMZ
viola_arvensis  23  DSMZ
amycolatopsis_albidoflavus  28  DSMZ
tetragenococcus_koreensis   28  DSMZ
panaeolus_papilionaceus 24  DSMZ
geomys_pinetis  36.3    white
vibrio_lutjanus 30  DSMZ
succiniclasticum_ruminis    37  DSMZ
microtetraspora_malaysiensis    28  DSMZ
blarina_carolinensis    36.8    white
amycolatopsis_palatopharyngis   28  DSMZ
rhodosporidium_toruloides   23  DSMZ
geobacter_bemidjiensis  30  DSMZ
acinetobacter_haemolyticus  28  DSMZ
actinoplanes_tereljensis    28  DSMZ
phyllostomus_hastatus   34.7    white
phacidium_infestans 25  DSMZ
dorea_formicigenerans   37  DSMZ
hoeflea_marina  28  DSMZ
naemacyclus_minor   22  DSMZ
methanosaeta_thermophila    58.3333333333   DSMZ
pholiota_carbonaria 25  DSMZ
sphingomonas_faeni  22  DSMZ
helicobacter_pullorum   37  DSMZ
solitalea_koreensis 28  DSMZ
dermacoccus_profundi    28  DSMZ
pseudomonas_pictorum    28  DSMZ
actinomadura_livida 28  DSMZ
leptonycteris_curasoae  35.7    white
psychrobacter_salsus    22  DSMZ
vibrio_inusitatus   28  DSMZ
stereum_rameale 20  DSMZ
photorhabdus_temperata  28.6666666667   DSMZ
clitocybe_lignatilis    25  DSMZ
actinocorallia_glomerata    28  DSMZ
aspergillus_giganteus   24.5    DSMZ
erwinia_amylovora   26.6666666667   DSMZ
hydrogenoanaerobacterium_saccharovorans 37  DSMZ
mycobacterium_aichiense 37  DSMZ
nocardia_pneumoniae 28  DSMZ
bacillus_pocheonensis   30  DSMZ
streptomonospora_alba   28  DSMZ
exobasidium_gracile 20  DSMZ
phenylobacterium_zucineum   30  DSMZ
amsonia_tabernaemontana 23  DSMZ
rattus_fuscipes 37.5    white
jannaschia_rubra    25  DSMZ
hereroa_rehneltiana 23  DSMZ

My actual input file has about 2000 entries, is the answer is as simple as the species names are incorrect or that IDs don't exist on the NCBI for all the species, does anyone have a solution to overcome this programmatically?

The first answer is that the species names does not exist. You can check that on the ncbi website. like here: https://www.ncbi.nlm.nih.gov/search/?term=Stereum+rameale

https://www.ncbi.nlm.nih.gov/search/?term=vibrio_lutjanus

Vibrio lutjanus seems not existing anyways if you look at other websites. For example https://www.arb-silva.de/search/ or

There is no solution to overcome this (in case of finding taxon id's), but you could do a double check if the name is right. Taxonomy is difficult, every body gives a different name and there are lots of synonyms. You can use the api's of taxonomic name website's like gbif or global names.

[EDIT]

You can also check the taxon id of the genus if species is not available. Here you can download the taxonomy information of the NCBI:

ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/

You need to download the zip file and probably need the files rankedlineage.dmp and merged.dmp The global names website can also be used for genus level. Dont know if entrez from BioPython can look up id's of genus level maybe that is also an option.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM