简体   繁体   English

有什么方法可以将微笑 .csv 文件转换为一种热编码?

[英]Is there any way to convert smiles .csv file in in one hot encoding?

I have converted one single smile in one hot encoding using RDKIT library while converting entire .csv file which contain smiles i am getting error.我使用 RDKIT 库在一种热编码中转换了一个单一的微笑,同时转换了包含微笑的整个 .csv 文件,我收到错误。

Successful Experiment :成功实验:

 new = 'O=C(O)C1=C(N2N=CC=N2)C=CC(N)=N1'

   output :
   array([[0., 0., 0., ..., 0., 0., 0.],
   [0., 0., 0., ..., 0., 0., 0.],
   [0., 0., 0., ..., 0., 0., 0.],
   ...,
   [0., 0., 0., ..., 0., 0., 0.],
   [0., 0., 0., ..., 0., 0., 0.],
   [0., 0., 0., ..., 0., 0., 0.]])

but while trying multiple smiles I am getting this error但是在尝试多次微笑时,我收到此错误

   TypeError: No registered converter was able to produce a C++ rvalue of type class 
   std::basic_string<wchar_t,struct std::char_traits<wchar_t>,class std::allocator<wchar_t> > from 
    this Python object of type DataFrame

I am sharing my code file while you can see that demo我正在分享我的代码文件,而你可以看到那个演示

Experimental code 实验代码

Demo dataset 演示数据集

If anyone can help me please let me know.如果有人可以帮助我,请告诉我。

Chem.MolToSmiles(Chem.MolFromSmiles( smiles )) can only convert one SMILES after the other, but you tried the whole dataframe. Chem.MolToSmiles(Chem.MolFromSmiles( smiles ))只能一个接一个地转换 SMILES,但是您尝试了整个数据框。 You have to to loop over the SMILES in your dataframe.您必须遍历数据框中的 SMILES。

This should work.这应该有效。

df = pd.read_csv('RouteSynthesisPrediction_o2h.csv')

for smi in df['Target']:
    smiles = Chem.CanonSmiles(smi)
    mat = smiles_encoder(smiles)
    dec = smiles_decoder(mat)
    print(mat)
    print(smi)
    print(smiles)
    print(dec)
    print()

Output:输出:

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
O=C(O)C1=C(N2N=CC=N2)C=CC(N)=N1
Nc1ccc(-n2nccn2)c(C(=O)O)n1
Nc1ccc(-n2nccn2)c(C(=O)O)n1

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
O=C(OC)C1=C(N2N=CC=N2)C=CC(N)=N1
COC(=O)c1nc(N)ccc1-n1nccn1
COC(=O)c1nc(N)ccc1-n1nccn1

.
.
.

There really isn't enough info here to do a full answer.这里真的没有足够的信息来做一个完整的答案。 The output looks like a Numpy array, and Numpy needs to preallocate the length of the floats.输出看起来像一个 Numpy 数组,Numpy 需要预先分配浮点数的长度。 If you have line one be 10 floats then when appending line two it has to fit within 10 and not go over.如果您的第一行是 10 个浮点数,那么在附加第二行时,它必须在 10 之内而不是越过。 It's not able to allocate the memory for it.它无法为其分配内存。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM