[英]weird svm behavior in R (e1071)
I ran the following code for a binary classification task w/ an SVM in both R (first sample) and Python (second example).我在 R(第一个示例)和 Python(第二个示例)中使用 SVM 为二进制分类任务运行了以下代码。
Given randomly generated data (X)
and response (Y)
, this code performs leave group out cross validation 1000 times.给定随机生成的数据
(X)
和响应(Y)
,此代码执行 1000 次离开组交叉验证。 Each entry of Y
is therefore the mean of the prediction across CV iterations.因此,
Y
每个条目都是跨 CV 迭代的预测平均值。
Computing area under the curve should give ~0.5, since X
and Y
are completely random.曲线下的计算面积应为 ~0.5,因为
X
和Y
是完全随机的。 However, this is not what we see.然而,这并不是我们所看到的。 Area under the curve is frequently significantly higher than 0.5.
曲线下面积经常显着高于 0.5。 The number of rows of
X
is very small, which can obviously cause problems. X
的行数非常少,这显然会导致问题。
Any idea what could be happening here?知道这里会发生什么吗? I know that I can either increase the number of rows of
X
or decrease the number of columns to mediate the problem, but I am looking for other issues.我知道我可以增加
X
的行数或减少列数来解决问题,但我正在寻找其他问题。
Y=as.factor(rep(c(1,2), times=14))
X=matrix(runif(length(Y)*100), nrow=length(Y))
library(e1071)
library(pROC)
colnames(X)=1:ncol(X)
iter=1000
ansMat=matrix(NA,length(Y),iter)
for(i in seq(iter)){
#get train
train=sample(seq(length(Y)),0.5*length(Y))
if(min(table(Y[train]))==0)
next
#test from train
test=seq(length(Y))[-train]
#train model
XX=X[train,]
YY=Y[train]
mod=svm(XX,YY,probability=FALSE)
XXX=X[test,]
predVec=predict(mod,XXX)
RFans=attr(predVec,'decision.values')
ansMat[test,i]=as.numeric(predVec)
}
ans=rowMeans(ansMat,na.rm=TRUE)
r=roc(Y,ans)$auc
print(r)
Similarly, when I implement the same thing in Python I get similar results.同样,当我在 Python 中实现同样的事情时,我得到了类似的结果。
Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan)
for i in range(n_iter):
# Get train/test index
train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)
if len(np.unique(Y)) == 1:
continue
test = np.array([i for i in range(len(Y)) if i not in train])
# train model
mod = SVC(probability=False)
mod.fit(X=X[train, :], y=Y[train])
# predict and collect answer
ansMat[test, i] = mod.predict(X[test, :])
ans = np.nanmean(ansMat, axis=1)
fpr, tpr, thresholds = roc_curve(Y, ans, pos_label=1)
print(auc(fpr, tpr))`
You should consider each iteration of cross-validation to be an independent experiment, where you train using the training set, test using the testing set, and then calculate the model skill score (in this case, AUC).您应该将交叉验证的每次迭代视为一个独立的实验,您使用训练集进行训练,使用测试集进行测试,然后计算模型技能分数(在本例中为 AUC)。
So what you should actually do is calculate the AUC for each CV iteration.所以你实际上应该做的是计算每次 CV 迭代的 AUC。 And then take the mean of the AUCs.
然后取 AUC 的平均值。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.