简体   繁体   English

使用Sci-kit Learn SVM时预测总是相同的

[英]Prediction always the same while using Sci-kit Learn SVM

I have a dataset where I'm trying to predict what kind of DNA a data entry is from the DNA makeup. 我有一个数据集,试图从DNA构成中预测数据输入的是哪种DNA。 For example, the string ATTAG...ACGAT might translate to EI . 例如,字符串ATTAG...ACGAT可能会转换为EI The possible outputs are either EI , IE , or N . 可能的输出是EIIEN The dataset can be investigated further here . 该数据集可在此处进行进一步调查。 I tried switching out kernels from linear to rbf but the results are the same. 我尝试将内核从linear切换为rbf但结果相同。 The SVM classifier seems to output N everytime. SVM分类器似乎每次输出N Any ideas why? 有什么想法吗? I'm a beginner to Sci-kit Learn. 我是Sci-kit Learn的初学者。

import pandas as pd
# 3190 total
training_data = pd.read_csv('new_training.csv')
test_data = pd.read_csv('new_test.csv')
frames = [training_data, test_data]
data = pd.concat(frames)
x = data.iloc[:, 0:59]
y = data.iloc[:, 60]

x = pd.get_dummies(x)
train_x = x.iloc[0:3000, :]
train_y = y.iloc[0:3000]
test_x = x.iloc[3000:3190, :]
test_y = y.iloc[3000:3190]

from sklearn import svm
from sklearn import preprocessing

clf = svm.SVC(kernel="rbf")
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(y)

print(label_encoder.transform(train_y))
clf.fit(train_x, label_encoder.transform(train_y))

for u in train_y.unique():
    print(u)

predictions = clf.predict(test_x)

correct = 0
total = len(predictions)
for i in range(total):
    prediction = label_encoder.inverse_transform(predictions[i])
    print('predicted %s and actual %s' % (prediction, test_y[i]))
    print(len(prediction))
    if prediction == test_y[i]:
        correct += 1

print('correct %d out of %d' % (correct, total))

First I import the training and test data, combine it and split it into as either x (inputs) or y (output label). 首先,我导入训练和测试数据,将其组合并分成x(输入)或y(输出标签)。 Then I convert x into the dummy variable version from the original 60 columns to like 300~ columns since each DNA spot can be A , T , G , C and sometimes N . 然后我将x转换为虚拟变量版本,从原来的60列转换为300〜列,因为每个DNA点可以是ATGC ,有时还可以是N Basically have either a 0 or 1 for all the possible inputs for each input. 基本上,每个输入的所有可能输入都为0或1。 (Is there a better way to do this? Sci-kit learn doesn't support categorical encoding and I tried best I could from this .) Then I separate the data again (I had to merge so that I can generate dummies on the whole data space). (有没有更好的方式来做到这一点?科幻Kit了解不支持绝对编码,我试图尽我所能,从这个 。)然后我再次进行数据分开(我不得不合并,这样我可以生成整体上假人数据空间)。

From here, I just run the svm stuff to fit the x and y labels and then to predict on test_x . 从这里开始,我只运行svm内容以适合xy标签,然后根据test_x进行预测。 I also had to encode/label y , from the string version to the numerical version. 我还必须编码/标记y ,从字符串版本到数字版本。 But yeah, it always produced N which I feel like is wrong. 但是,是的,它总是产生N ,我觉得这是错误的。 How do I fix? 我该如何解决? Thank you! 谢谢!

I think the issue is the way data is splitted into train and test. 我认为问题在于将数据拆分为训练和测试的方式。 You have taken the first 3000 samples for training and the remaining 190 samples for testing. 您已获取了前3000个样本进行培训,其余190个样本进行了测试。 I found out that with such training the classifier yields the true class label for all the test samples ( score = 1.0). 我发现通过这种训练,分类器可以为所有测试样本生成真实的分类标签( 得分 = 1.0)。 I have also noticed that the last 190 samples of the dataset have the same class label , namely 'N' . 我还注意到,数据集的最后190个样本具有相同的类标签 ,即'N' Therefore the result you obtained is correct. 因此,您获得的结果是正确的。

I would recommend you to split the dataset into train and test through ShuffleSplit with test_size=.06 (this corresponds approximately to 190/3190 although to make visualization of results easier I used test_size=.01 in the sample run below). 我建议你通过拆分数据集为训练和测试ShuffleSplittest_size=.06 (这大约相当于三千一百九十零分之一百九十○虽然使结果的可视化更容易我用test_size=.01下运行示例)。 For the sake of simplicity I would also suggest you to use OneHotEncoder to encode the categorical values of the features. 为了简单起见,我还建议您使用OneHotEncoder对功能的分类值进行编码。

Here's the full code (I have taken the liberty to perform some refactoring): 这是完整的代码(我已自由执行一些重构):

import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import ShuffleSplit
from sklearn import svm

data = np.loadtxt(r'splice.data', delimiter=',', dtype='string')

bases = {'A': 0, 'C': 1, 'D': 2, 'G': 3, 'N': 4, 'R': 5, 'S': 6, 'T': 7}

X_base = np.asarray([[bases[c] for c in seq.strip()] for seq in data[:, 2]])
y_class = data[:, 0]

enc = OneHotEncoder(n_values=len(bases))
lb = LabelEncoder()

enc.fit(X_base)  
lb.fit(y_class)

X = enc.transform(X_base).toarray()
y = lb.transform(y_class)

rs = ShuffleSplit(n_splits=1, test_size=.01, random_state=0)
train_index, test_index = rs.split(X).next()
train_X, train_y = X[train_index], y[train_index]
test_X, test_y = X[test_index], y[test_index]

clf = svm.SVC(kernel="rbf")
clf.fit(train_X, train_y)

predictions = clf.predict(test_X)

Demo: 演示:

Out[2]: 
array(['IE', 'EI', 'EI', 'EI', 'EI', 'IE', 'N', 'N', 'EI', 'N', 'N', 'IE',
       'IE', 'N', 'N', 'IE', 'EI', 'N', 'N', 'EI', 'IE', 'EI', 'IE', 'N',
       'EI', 'N', 'IE', 'N', 'EI', 'N', 'N', 'EI'], 
      dtype='|S79')

In [3]: y_class[test_index]
Out[3]: 
array(['IE', 'EI', 'EI', 'EI', 'EI', 'IE', 'N', 'N', 'EI', 'N', 'N', 'IE',
       'IE', 'N', 'N', 'IE', 'EI', 'N', 'N', 'EI', 'IE', 'EI', 'IE', 'N',
       'IE', 'N', 'IE', 'N', 'EI', 'N', 'N', 'EI'], 
      dtype='|S79')

In [4]: clf.score(test_X, test_y)
Out[4]: 0.96875

Note : Make certain your sklearn version is 0.18.1, otherwise the code above might not work. 注意 :请确保您的sklearn版本为0.18.1,否则上面的代码可能不起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM