简体   繁体   English

OneHotEncoding 蛋白质序列

[英]OneHotEncoding Protein Sequences

I have an original dataframe of sequences listed below and am trying to use one-hot encoding and then store these in a new dataframe, I am trying to do it with the following code but am not able to store because I get the following output afterwards:我有下面列出的序列的原始数据帧,我尝试使用单热编码,然后将它们存储在一个新的数据帧中,我尝试使用以下代码执行此操作但无法存储,因为之后我得到以下输出:

Code:代码:

onehot_encoder = OneHotEncoder()
sequence = np.array(list(x_train['sequence'])).reshape(-1, 1)
encoded_sequence = onehot_encoder.fit_transform(sequence).toarray()
encoded_sequence

在此处输入图像描述

but get error但得到错误

ValueError: Wrong number of items passed 12755, placement implies 1

You get that strange array because it treats every sequence as an entry and tries to one-hot encode it, we can use an example:您会得到那个奇怪的数组,因为它将每个序列都视为一个条目并尝试对其进行一次热编码,我们可以使用一个示例:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder 
df = pd.DataFrame({'sequence':['AQAVPW','AMAVLT','LDTGIN']})

enc = OneHotEncoder()
seq = np.array(df['sequence']).reshape(-1,1)
encoded = enc.fit(seq)
encoded.transform(seq).toarray()

array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.]])

encoded.categories_

[array(['AMAVLT', 'AQAVPW', 'LDTGIN'], dtype=object)]

Since your entries are unique, you get this all zeros matrix.由于您的条目是唯一的,因此您会得到这个全零矩阵。 You can understand this better if you use pd.get_dummies如果您使用 pd.get_dummies,您可以更好地理解这一点

pd.get_dummies(df['sequence'])

  AMAVLT AQAVPW LDTGIN
0   0   1   0
1   1   0   0
2   0   0   1

There's two ways to do this, one way is to simply count the amino acid occurrence and use that as a predictor, I hope I get the amino acids correct (from school long time ago):有两种方法可以做到这一点,一种方法是简单地计算氨基酸的出现并将其用作预测因子,我希望我得到正确的氨基酸(很久以前从学校开始的):

from Bio import SeqIO
from Bio.SeqUtils.ProtParam import ProteinAnalysis

pd.DataFrame([ProteinAnalysis(i).count_amino_acids() for i in df['sequence']])

    A   C   D   E   F   G   H   I   K   L   M   N   P   Q   R   S   T   V   W   Y
0   2   0   0   0   0   0   0   0   0   0   0   0   1   1   0   0   0   1   1   0
1   2   0   0   0   0   0   0   0   0   1   1   0   0   0   0   0   1   1   0   0
2   0   0   1   0   0   1   0   1   0   1   0   1   0   0   0   0   1   0   0   0

The other is to split the sequences, and do this encoding by position, and this requires the sequences to be equally long, and that you have enough memory:另一种是拆分序列,并按位置进行编码,这要求序列同样长,并且你有足够的内存:

byposition = df['sequence'].apply(lambda x:pd.Series(list(x)))
byposition

    0   1   2   3   4   5
0   A   Q   A   V   P   W
1   A   M   A   V   L   T
2   L   D   T   G   I   N

pd.get_dummies(byposition)

    0_A 0_L 1_D 1_M 1_Q 2_A 2_T 3_G 3_V 4_I 4_L 4_P 5_N 5_T 5_W
0   1   0   0   0   1   1   0   0   1   0   0   1   0   0   1
1   1   0   0   1   0   1   0   0   1   0   1   0   0   1   0
2   0   1   1   0   0   0   1   1   0   1   0   0   1   0   0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM