简体   繁体   English

通过e1071中R的概率,SVM预测中的“随机”或非确定性因子是什么?

[英]what is the “random” or non-deterministic factor inside SVM prediction by probabilities in e1071 in R?

I'm new to SVM and e1071. 我是SVM和e1071的新手。 I found that the results are different every time I run the exact same code. 我发现每次运行完全相同的代码时结果都不同。

For example: 例如:

data(iris)
library(e1071)

model <- svm(Species ~ ., data = iris[-150,], probability = TRUE)
pred <- predict(model, iris[150,-5], probability = TRUE)
result1 <- as.data.frame(attr(pred, "probabilities"))

model <- svm(Species ~ ., data = iris[-150,], probability = TRUE)
pred <- predict(model, iris[150,-5], probability = TRUE)
result2 <- as.data.frame(attr(pred, "probabilities"))

then I got result1 as: 然后我得到了result1

         setosa versicolor virginica
150 0.009704854  0.1903696 0.7999255

and result2 as: result2为:

        setosa versicolor virginica
150 0.01006306  0.1749947 0.8149423

and the result keeps change every round. 结果每一轮都在变化。

Here I'm using the first 149 rows as a training set and the last row as testing. 在这里,我使用前149行作为训练集,最后一行作为测试。 The probabilities for each classes in result1 and result2 are not exactly the same. result1result2中每个类的概率并不完全相同。 I'm guessing there is some process during the prediction that is "random". 我猜测预测期间有一些“随机”的过程。 How is this happening? 这是怎么回事?

I'm aware that the predicted probabilities can be fixed if I set.seed() with the same number before each call. 我知道如果我在每次调用之前使用相同的数字set.seed() ,则可以修复预测的概率。 I'm not "aiming" for a fixed prediction result, but just curious why this happens and what steps it takes to generate the probabilities prediction. 我不是“瞄准”固定的预测结果,而只是好奇为什么会发生这种情况以及生成概率预测所需的步骤。

The slight difference doesn't really have a big impact on the iris data, since the last sample would still be predicted as "virginica". 微小的差异并不会对虹膜数据产生很大影响,因为最后一个样本仍然会被预测为“virginica”。 But when my data (with two classes A and B) is not that "good", and an unknown sample is predicted to have probability of 0.489 and 0.521 for two times of being class A, it will be confusing. 但是当我的数据(有两个A类和B类)不是那么“好”时,预测一个未知样本的概率为0.489和0.521两次是A类,那将是令人困惑的。

Thanks! 谢谢!

SVM uses a cross-validation step in developing the estimates of probabilities. SVM使用交叉验证步骤来开发概率估计。 The source code for that step starts with: 步骤源代码以:

// Cross-validation decision values for probability estimates
static void svm_binary_svc_probability(
    const svm_problem *prob, const svm_parameter *param,
    double Cp, double Cn, double& probA, double& probB)
{
    int i;
    int nr_fold = 5;
    int *perm = Malloc(int,prob->l);
    double *dec_values = Malloc(double,prob->l);

    // random shuffle
    GetRNGstate();
    for(i=0;i<prob->l;i++) perm[i]=i;
    for(i=0;i<prob->l;i++)
    {
        int j = i+((int) (unif_rand() * (prob->l-i))) % (prob->l-i);
        swap(perm[i],perm[j]);
    }

You can create "predictability" by setting the random seed just before the call: 您可以通过在调用之前设置随机种子来创建“可预测性”:

> data(iris)
> library(e1071)
> set.seed(123)
> model <- svm(Species ~ ., data = iris[-150,], probability = TRUE)
> pred <- predict(model, iris[150,-5], probability = TRUE)
> result1 <- as.data.frame(attr(pred, "probabilities"))
> set.seed(123)
> model <- svm(Species ~ ., data = iris[-150,], probability = TRUE)
> pred <- predict(model, iris[150,-5], probability = TRUE)
> result2 <- as.data.frame(attr(pred, "probabilities"))
> result1
         setosa versicolor virginica
150 0.009114718  0.1734126 0.8174727
> result2
         setosa versicolor virginica
150 0.009114718  0.1734126 0.8174727

But I am reminded of the epigram from Emerson: "A foolish consistency is the hobgoblin of little minds." 但我想起艾默生的警句:“愚蠢的一致性是小脑袋的大人物。”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM