简体   繁体   中英

100% error rate on test set with one class svm

I am trying to detect outlier images. But I'm getting bizarre results from the model.

I've read in the images with cv2, flattened them into 1d-arrays, and turned them into a pandas dataframe and then fed that into the SVM.

import numpy as np
import cv2
import glob
import pandas as pd
import sys, os
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import *
import seaborn as sns`

load the labels and files

labels_wt = np.loadtxt("labels_wt.txt", delimiter="\t", dtype="str")
files_wt = np.loadtxt("files_wt.txt", delimiter="\t", dtype="str")`

load and flatten the images

wt_images_tmp = [cv2.imread(file) for file in files_wt]
wt_images = [image.flatten() for image in wt_images_tmp]
tmp3 = np.array(wt_images)
mutant_images_tmp = [cv2.imread(file) for file in files_mut]
mutant_images = [image.flatten() for image in mutant_images_tmp]
tmp4 = np.array(mutant_images)


X = pd.DataFrame(tmp3) #load the wild-type images
y = pd.Series(labels_wt)
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=42) 
X_outliers = pd.DataFrame(tmp4)
clf = svm.OneClassSVM(nu=0.15, kernel="rbf", gamma=0.0001)
clf.fit(X_train)

Then I evaluate the results according to the sklearn tutorial on oneclass SVM.

y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)
n_error_train = y_pred_train[y_pred_train == -1].size
n_error_test = y_pred_test[y_pred_test == -1].size
n_error_outliers = y_pred_outliers[y_pred_outliers == 1].size

print(n_error_train / len(y_pred_train))
print(float(n_error_test) / float(len(y_pred_test)))
print(n_error_outliers / len(y_pred_outliers))`

my error rates on the training set have been variable (10-30%), but on the test set, they have never gone below 100%. Am I doing this wrong?

My guess is that you are setting random_state = 42 , this is biasing your train_test_split to always have the same splitting pattern. You can read more about it in this answer. Don't specify any state and run the code again, so:

X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2)

This will show different results. Once you are sure this works, make sure yo then do cross-validation , possibly using k-fold validation. Let us know if this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM