简体   繁体   English

从fasta文件的标题中解析特定字符串

[英]Parsing specific string from header from fasta file

I'm looking to obtain organism name from a fasta header file, where I'm interested in from the description to extract when OS=(Organism Name) .我希望从 fasta 头文件中获取生物体名称,我感兴趣的是从描述中提取OS=(Organism Name)时提取的名称

FASTA HEADER 快速头
>sp|Q8T8B9|ACMSD_CAEEL 2-amino-3-carboxymuconate-6-semialdehyde decarboxylase OS=Caenorhabditis elegans GN=acsd-1 PE=2 SV=1 MPICEFSATSKSRKIDVHAHVLPKNIPDFQEKFGYPGFVRLDHKEDGTTHMVKDGKLFRV VEPNCFDTETRIADMNRANVNVQCLSTVPVMFSYWAKPADTEIVARFVNDDLLAECQKFP GKEHIVLGTDYPFPLGEL EVGRVVEEYKPFSAKDREDLLWKNAVKMLDIDENLLFNKDF >sp|P34455|ACON_CAEEL Probable aconitate hydratase, mitochondrial OS=Caenorhabditis elegans GN=aco-2 PE=3 SV=2 MNSLLRLSHLAGPAHYRALHSSSSIWSKVAISKFEPKSYLPYEKLSQTVKIVKDRLKRPL TLSEKILYGHLDQPKTQDIERGVSYLRLRPDRVAMQDATAQMAMLQFISSGLPKTAVPST IHCDHLIEAQKGGAQDLARAKDLNKEVFNFLATAGSKYGVGFWKPGSGIIHQIILENYAF
Code for Obtaining FastaHeader 获取 FastaHeader 的代码
Caenorhabditis elegans
Caenorhabditis elegans

Current Output:电流输出:

 >sp|Q8T8B9|ACMSD_CAEEL 2-amino-3-carboxymuconate-6-semialdehyde decarboxylase OS=Caenorhabditis elegans GN=acsd-1 PE=2 SV=1 >sp|P34455|ACON_CAEEL Probable aconitate hydratase, mitochondrial OS=Caenorhabditis elegans GN=aco-2 PE=3 SV=2

Desired Output:期望输出:

 Caenorhabditis elegans Caenorhabditis elegans

You can search for your information using a regex:您可以使用正则表达式搜索您的信息:

import re
example = "sp|P34455|ACON_CAEEL Probable aconitate hydratase, mitochondrial OS=Caenorhabditis elegans GN=aco-2 PE=3 SV=2"

start = re.search("OS", example).start()
result = example[start+3:].split("GN")[0].strip()
print(result)
>> Caenorhabditis elegans

This Code looks for the text after "OS=" until "GN" and removes the whitespaces at the end此代码查找“OS =”之后的文本,直到“GN”并删除末尾的空格

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM