简体   繁体   English

如何使用python或R将三个字母的氨基酸代码转换为一个字母代码?

[英]How do I convert the three letter amino acid codes to one letter code with python or R?

I have a fasta file as shown below. 我有一个fasta文件,如下所示。 I would like to convert the three letter codes to one letter code. 我想将三个字母的代码转换为一个字母代码。 How can I do this with python or R? 我怎么能用python或R做到这一点?

>2ppo
ARGHISLEULEULYS
>3oot
METHISARGARGMET

desired output 期望的输出

>2ppo
RHLLK
>3oot
MHRRM

your suggestions would be appreciated!! 你的建议将不胜感激!

BioPython already has built-in dictionaries to help with such translations. BioPython已经内置了词典来帮助完成这些翻译。 Following commands will show you a whole list of available dictionaries: 以下命令将显示可用字典的完整列表:

import Bio
help(Bio.SeqUtils.IUPACData)

The predefined dictionary you are looking for: 您正在寻找的预定义词典:

Bio.SeqUtils.IUPACData.protein_letters_3to1['Ala']

Use a dictionary to look up the one letter codes: 使用字典查找单字母代码:

d = {'CYS': 'C', 'ASP': 'D', 'SER': 'S', 'GLN': 'Q', 'LYS': 'K',
     'ILE': 'I', 'PRO': 'P', 'THR': 'T', 'PHE': 'F', 'ASN': 'N', 
     'GLY': 'G', 'HIS': 'H', 'LEU': 'L', 'ARG': 'R', 'TRP': 'W', 
     'ALA': 'A', 'VAL':'V', 'GLU': 'E', 'TYR': 'Y', 'MET': 'M'}

And a simple function to match the three letter codes with one letter codes for the entire string: 还有一个简单的函数可以匹配三个字母代码和整个字符串的一个字母代码:

def shorten(x):
    if len(x) % 3 != 0: 
        raise ValueError('Input length should be a multiple of three')

    y = ''
    for i in range(len(x)/3):
            y += d[x[3*i:3*i+3]]
    return y

Testing your example: 测试你的例子:

>>> shorten('ARGHISLEULEULYS')
'RHLLK'

Here is a way to do it in R: 以下是在R中执行此操作的方法:

# Variables:
foo <- c("ARGHISLEULEULYS","METHISARGARGMET")

# Code maps:
code3 <- c("Ala", "Arg", "Asn", "Asp", "Cys", "Glu", "Gln", "Gly", "His", 
"Ile", "Leu", "Lys", "Met", "Phe", "Pro", "Ser", "Thr", "Trp", 
"Tyr", "Val")
code1 <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", 
"M", "F", "P", "S", "T", "W", "Y", "V")

# For each code replace 3letter code by 1letter code:
for (i in 1:length(code3))
{
    foo <- gsub(code3[i],code1[i],foo,ignore.case=TRUE)
}

Results in : 结果是 :

> foo
[1] "RHLLK" "MHRRM"

Note that I changed the variable name as variable names are not allowed to start with a number in R. 请注意,我更改了变量名称,因为不允许变量名以R中的数字开头。

>>> src = "ARGHISLEULEULYS"
>>> trans = {'ARG':'R', 'HIS':'H', 'LEU':'L', 'LYS':'K'}
>>> "".join(trans[src[x:x+3]] for x in range(0, len(src), 3))
'RHLLK'

You just need to add the rest of the entries to the trans dict. 您只需要将其余条目添加到trans dict中。

Edit: 编辑:

To make the rest of trans , you can do this. 要完成其余的trans ,你可以这样做。 File table : 文件table

Ala A
Arg R
Asn N
Asp D
Cys C
Glu E
Gln Q
Gly G
His H
Ile I
Leu L
Lys K
Met M
Phe F
Pro P
Ser S
Thr T
Trp W
Tyr Y
Val V

Read it: 阅读:

trans = dict((l.upper(), s) for l, s in
             [row.strip().split() for row in open("table").readlines()])

You may try looking into and installing Biopython since you are parsing a .fasta file and then converting to one letter codes. 您可以尝试查看并安装Biopython,因为您正在解析.fasta文件,然后转换为一个字母代码。 Unfortunately, Biopython only has the function seq3(in package Bio::SeqUtils) which does the inverse of what you want. 不幸的是,Biopython只有函数seq3(在包中Bio :: SeqUtils),它与你想要的相反。 Example output in IDLE: IDLE中的示例输出:

>>>seq3("MAIVMGRWKGAR*")
>>>'MetAlaIleValMetGlyArgTrpLysGlyAlaArgTer'

Unfortunately, there is no 'seq1' function (yet...) but I thought this might be helpful to you in the future. 不幸的是,没有'seq1'功能(但......)但我认为这可能会对你有所帮助。 As far as your problem, Junuxx is correct. 至于你的问题,Junuxx是正确的。 Create a dictionary and use a for loop to read the string in blocks of three and translate. 创建一个字典并使用for循环以三个块的形式读取字符串并进行翻译。 Here is a similar function to the one he provided that is all-inclusive and handles lower cases as well. 这是一个类似于他提供的功能,包括所有功能,并处理小案例。

def AAcode_3_to_1(seq):
    '''Turn a three letter protein into a one letter protein.

    The 3 letter code can be upper, lower, or any mix of cases
    The seq input length should be a factor of 3 or else results
    in an error

    >>>AAcode_3_to_1('METHISARGARGMET')
    >>>'MHRRM'

    '''
    d = {'CYS': 'C', 'ASP': 'D', 'SER': 'S', 'GLN': 'Q', 'LYS': 'K',
     'ILE': 'I', 'PRO': 'P', 'THR': 'T', 'PHE': 'F', 'ASN': 'N', 
     'GLY': 'G', 'HIS': 'H', 'LEU': 'L', 'ARG': 'R', 'TRP': 'W', 'TER':'*',
     'ALA': 'A', 'VAL':'V', 'GLU': 'E', 'TYR': 'Y', 'MET': 'M','XAA':'X'}

    if len(seq) %3 == 0:
        upper_seq= seq.upper()
        single_seq=''
        for i in range(len(upper_seq)/3):
            single_seq += d[upper_seq[3*i:3*i+3]]
        return single_seq
    else:
        print("ERROR: Sequence was not a factor of 3 in length!")

Biopython has a nice solution Biopython有一个很好的解决方案

>>> from Bio.PDB.Polypeptide import *
>>> three_to_one('ALA')
'A'

For your example, I'll solve it by this one liner 对于你的例子,我将通过这一个班轮解决它

>>> from Bio.PDB.Polypeptide import *
>>> str3aa = 'ARGHISLEULEULYS'
>>> "".join([three_to_one(aa3) for aa3 in [ "".join(g) for g in zip(*(iter(str3aa),) * 3)]])
>>> 'RHLLK'

They may criticize me for this type of one liner :), but deep in my heart I am still in love with PERL. 他们可能批评我这种类型的一个班轮:),但在我内心深处,我仍然爱着PERL。

Using R: 使用R:

convert <- function(l) {

  map <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", "I",
           "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")

  names(map) <- c("ALA", "ARG", "ASN", "ASP", "CYS", "GLU", "GLN",
                  "GLY", "HIS", "ILE", "LEU", "LYS", "MET", "PHE",
                  "PRO", "SER", "THR", "TRP", "TYR", "VAL")

  sapply(strsplit(l, "(?<=[A-Z]{3})", perl = TRUE),
         function(x) paste(map[x], collapse = ""))
}

convert(c("ARGHISLEULEULYS", "METHISARGARGMET"))
# [1] "RHLLK" "MHRRM"

Another way to do it is with the seqinr and iPAC package in R. 另一种方法是使用R中的seqinriPAC包。

# install.packages("seqinr")
# source("https://bioconductor.org/biocLite.R")
# biocLite("iPAC")

library(seqinr)
library(iPAC)

#read in file
fasta = read.fasta(file = "test_fasta.fasta", seqtype = "AA", as.string = T, set.attributes = F)
#split string
n = 3
fasta1 = lapply(fasta,  substring(x,seq(1,nchar(x),n),seq(n,nchar(x),n)))
#convert the three letter code for each element in the list 
fasta2 = lapply(fasta1, function(x) paste(sapply(x, get.SingleLetterCode), collapse = ""))

# > fasta2
# $`2ppo`
# [1] "RHLLK"
#
# $`3oot`
# [1] "MHRRM"
my %aa_hash=(
  Ala=>'A',
  Arg=>'R',
  Asn=>'N',
  Asp=>'D',
  Cys=>'C',
  Glu=>'E',
  Gln=>'Q',
  Gly=>'G',
  His=>'H',
  Ile=>'I',
  Leu=>'L',
  Lys=>'K',
  Met=>'M',
  Phe=>'F',
  Pro=>'P',
  Ser=>'S',
  Thr=>'T',
  Trp=>'W',
  Tyr=>'Y',
  Val=>'V',
  Sec=>'U',                       #http://www.uniprot.org/manual/non_std;Selenocysteine (Sec) and pyrrolysine (Pyl)
  Pyl=>'O',
);


    while(<>){
            chomp;
            my $aa=$_;
            warn "ERROR!! $aa invalid or not found in hash\n" if !$aa_hash{$aa};
            print "$aa\t$aa_hash{$aa}\n";
    }

Use this perl script to convert triplet aa codes to single letter code. 使用此perl脚本将三元组aa代码转换为单字母代码。

For those who land here on 2017 and beyond: 对于2017年及以后登陆的人:

Here's a single line Linux bash command to convert protein amino acid three letter code to single letter code in a text file. 这是一个单行Linux bash命令,用于将蛋白质氨基酸三字母代码转换为文本文件中的单字母代码。 I know this is not very elegant, but I hope this helps someone searching for the same and want to use single line command. 我知道这不是很优雅,但我希望这有助于搜索相同的人,并希望使用单行命令。

sed 's/ALA/A/g;s/CYS/C/g;s/ASP/D/g;s/GLU/E/g;s/PHE/F/g;s/GLY/G/g;s/HIS/H/g;s/HID/H/g;s/HIE/H/g;s/ILE/I/g;s/LYS/K/g;s/LEU/L/g;s/MET/M/g;s/ASN/N/g;s/PRO/P/g;s/GLN/Q/g;s/ARG/R/g;s/SER/S/g;s/THR/T/g;s/VAL/V/g;s/TRP/W/g;s/TYR/Y/g;s/MSE/X/g' < input_file_three_letter_code.txt > output_file_single_letter_code.txt

Solution for the original question above, as a single command line: 解决上面的原始问题,作为单个命令行:

sed 's/.\{3\}/& /g' | sed 's/ALA/A/g;s/CYS/C/g;s/ASP/D/g;s/GLU/E/g;s/PHE/F/g;s/GLY/G/g;s/HIS/H/g;s/HID/H/g;s/HIE/H/g;s/ILE/I/g;s/LYS/K/g;s/LEU/L/g;s/MET/M/g;s/ASN/N/g;s/PRO/P/g;s/GLN/Q/g;s/ARG/R/g;s/SER/S/g;s/THR/T/g;s/VAL/V/g;s/TRP/W/g;s/TYR/Y/g;s/MSE/X/g' | sed 's/ //g' < input_file_three_letter_code.txt > output_file_single_letter_code.txt

Explanation: 说明:

[1] sed 's/.\\{3\\}/& /g' will spllit the sequence. [1] sed 's/.\\{3\\}/& /g'将拼写序列。 It will add a space after every 3rd letter. 它会在每第3个字母后添加一个空格。

[2] The second ' sed' command in the pipe will take the output of above and convert to single letter code. [2]管道中的第二个' sed'命令将获取上面的输出并转换为单字母代码。 Add any non-standard residue as s/XYZ/X/g; 加入任何非标准残留物作为s/XYZ/X/g; to this command. 这个命令。

[3] The third ' sed ' command, sed 's/ //g' will remove white-space. [3]第三个' sed '命令, sed 's/ //g'将删除空格。

Python 3 solutions. Python 3解决方案。

In my work, the annoyed part is that the amino acid codes can refer to the modified ones which often appear in the PDB/mmCIF files, like 在我的工作中,烦恼的部分是氨基酸代码可以指经常出现在PDB / mmCIF文件中的修改后的代码,如

'Tih'-->'A'. 'TIH' - > 'A'。

So the mapping can be more than 22 pairs. 因此映射可以超过22对。 The 3rd party tools in Python like Python中的第三方工具就像

Bio.SeqUtils.IUPACData.protein_letters_3to1 Bio.SeqUtils.IUPACData.protein_letters_3to1

cannot handle it. 无法处理它。 My easiest solution is to use the http://www.ebi.ac.uk/pdbe-srv/pdbechem to find the mapping and add the unusual mapping to the dict in my own functions whenever I encounter them. 我最简单的解决方案是使用http://www.ebi.ac.uk/pdbe-srv/pdbechem来查找映射,并在遇到它们时在我自己的函数中将不寻常的映射添加到dict。

def three_to_one(three_letter_code):
    mapping = {'Aba':'A','Ace':'X','Acr':'X','Ala':'A','Aly':'K','Arg':'R','Asn':'N','Asp':'D','Cas':'C',
           'Ccs':'C','Cme':'C','Csd':'C','Cso':'C','Csx':'C','Cys':'C','Dal':'A','Dbb':'T','Dbu':'T',
           'Dha':'S','Gln':'Q','Glu':'E','Gly':'G','Glz':'G','His':'H','Hse':'S','Ile':'I','Leu':'L',
           'Llp':'K','Lys':'K','Men':'N','Met':'M','Mly':'K','Mse':'M','Nh2':'X','Nle':'L','Ocs':'C',
           'Pca':'E','Phe':'F','Pro':'P','Ptr':'Y','Sep':'S','Ser':'S','Thr':'T','Tih':'A','Tpo':'T',
           'Trp':'W','Tyr':'Y','Unk':'X','Val':'V','Ycm':'C','Sec':'U','Pyl':'O'} # you can add more
    return mapping[three_letter_code[0].upper() + three_letter_code[1:].lower()]

The other solution is to retrieve the mapping online (But the url and the html pattern may change through time): 另一个解决方案是在线检索映射(但url和html模式可能会随时间变化):

import re
import urllib.request

def three_to_one_online(three_letter_code):
    url = "http://www.ebi.ac.uk/pdbe-srv/pdbechem/chemicalCompound/show/" + three_letter_code
    with urllib.request.urlopen(url) as response:
        single_letter_code = re.search('\s*<td\s*>\s*<h3>One-letter code.*</h3>\s*</td>\s*<td>\s*([A-Z])\s*</td>', response.read().decode('utf-8')).group(1)
    return single_letter_code

Here I directly use the re instead of the html parsers for the simplicity. 为了简单起见,我在这里直接使用re而不是html解析器。

Hope these can help. 希望这些可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 excel 文件中将三字母氨基酸转换为单字母 - How do i convert a three-letter amino acids to single letter in an excel file 如何在带有 biopython 的 excel 中将一列 3 字母氨基酸转换为 1 字母氨基酸? - How do I convert a column of 3-letter amino acids to 1- letter amino acids in excel w/ biopython? 如何在python中找到编码相同氨基酸的密码子? - How to find the codons that codes for the same amino acid in python? 将输入代码分成3个字母,并使用if语句返回DNA-&gt;氨基酸字母 - Splitting input code into 3 letters and using if statements to return DNA->Amino Acid Letter BioPython:如何将氨基酸字母表转换为 - BioPython: How to convert the amino acid alphabet to 如何将 python 中输入的字母转换为小写? 我的代码似乎不起作用 - How do I convert the letter entered to lowercase in python? My code does not seem to work Python:如何使用for循环将每个字母转换为小写? - Python: How do I convert every letter to a lowercase using a for loop? 如何在Python中输入一个字母以输入输入 - How do I type one letter at a time for an input in Python 氨基酸描述符的 R 库 - R library for amino acid descriptors 如何计算大型 FASTA 文件中包含的序列的氨基酸组成百分比 - How do I calculate percentage amino acid composition of sequences contained in a large FASTA file
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM