I have a.fasta file (.txt essentiallly) of about 145000 entries that are formatted as below
>gi|393182|gb|AAA40101.1| cytokine [Mus musculus]
MDAKVVAVLALVLAALCISDGKPVSLSYRCPCRFFESHIARANVKHLKILNTPNCALQIVARLKNNNRQV
CIDPKLKWIQEYLEKALNKRLKM
>gi|378792467|pdb|3UNH|Y Chain Y, Mouse 20s Immunoproteasome
TTTLAFKFQHGVIVAVDSRATAGSYISSLRMNKVIEINPYLLGTMSGCAADCQYWERLLAKECRLYYLRN
GERISVSAASKLLSNMMLQYRGMGLSMGSMICGWDKKGPGLYYVDDNGTRLSGQMFSTGSGNTYAYGVMD
SGYRQDLSPEEAYDLGRRAIAYATHRDNYSGGVVNMYHMKEDGWVKVESSDVSDLLYKYGEAAL
>gi|378792462|pdb|3UNH|T Chain T, Mouse 20s Immunoproteasome
MSSIGTGYDLSASTFSPDGRVFQVEYAMKAVENSSTAIGIRCKDGVVFGVEKLVLSKLYEEGSNKRLFNV
DRHVGMAVAGLLADARSLADIAREEASNFRSNFGYNIPLKHLADRVAMYVHAYTLYSAVRPFGCSFMLGS
YSANDGAQLYMIDPSGVSYGYWGCAIGKARQAAKTEIEKLQMKEMTCRDVVKEVAKIIYIVHDEVKDKAF
ELELSWVGELTKGRHEIVPKDIREEAEKYAKESLKEEDESDDDNM
I have been using various BioPython parsing bits and pieces but I think because of the size of the search it fails. I was hoping someone on here would know of a more efficient way?
Thanks in advance!
Rather than parsing the not entirely consistent FASTA header line for the species, you could just extract the GI number and then look up the NCBI taxonomy ID, eg see http://lists.open-bio.org/pipermail/biopython/2009-June/005304.html - and from the taxid you can get the species name, common name, lineage etc. See ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt or if you prefer an online solution, the Entrez Utilities (EUtils) are another option.
Without having the data to try it out on, I'd suggest the quickest way would be to load the gi
you want into a set. Then read through the fasta file with the minimum amount of processing to extract the gi
in the line and then if the gi is in the wanted set, extract the last |
delimited field.
Eg:
# assuming gi list is in a file, one per line
with open('lookup_list.txt','r') as f:
wanted = set(x.strip() for x in f)
with open('data.fasta','r') as f:
for line in f:
if line and line[0] == '>':
gi = line[4:line.find('|',4)]
if gi in wanted:
text = line[line.rfind('|')+1:] # Now process the text to extract species
print text
How you extract the species name from the description field will depend on the various formats you find.
Really, really naive approach
with open('my.fasta') as fd:
for line in fd:
if line.startswith('>'):
if '[' in line:
gi=line.split('|')[1]
name=line.split('[')[-1]
print gi, name[:-2]
For the above, the output is:
393182 Mus musculus
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.