简体   繁体   English

Scikit-learn SVM 只有一个 class 异常

[英]Scikit-learn SVM only one class exception

I'm trying ensembling SVMs with Scikit-learn, specifically optimizing hyperparameters.我正在尝试使用 Scikit-learn 集成 SVM,特别是优化超参数。 I'm quite randomly getting the following error:我很随机地收到以下错误:

  File "C:\Users\jakub\anaconda3\envs\SVM_ensembles\lib\site-packages\sklearn\svm\_base.py", line 250, in _dense_fit
    self.probB_, self.fit_status_ = libsvm.fit(
  File "sklearn\svm\_libsvm.pyx", line 191, in sklearn.svm._libsvm.fit
ValueError: Invalid input - all samples with positive weights have the same label.

From what I understand, this is from file https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/src/libsvm/svm.cpp and has something to do with examples from 1 class only going into SVM.据我了解,这是来自文件https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/src/libsvm/svm.cpp并且与 1 class 中的示例有关只进入 SVM。 I'm using stratified K-fold cross validation and have quite balanced dataset (45% one class, 55% other), so this should not happen anyway.我正在使用分层的 K 折交叉验证并拥有相当平衡的数据集(45% 一个 class,55% 其他),所以无论如何都不应该发生这种情况。

What can I do?我能做些什么?

Optimizing code that throws error:优化引发错误的代码:

def get_best_ensemble_params(X_train, y_train, X_test, y_test, n_tries=5):
    search_spaces = {
        "max_samples": Real(0.1, 1, "uniform"),
        "max_features": Real(0.1, 1, "uniform"),

        "kernel": Categorical(["linear", "poly", "rbf", "sigmoid"]),
        "C": Real(1e-6, 1e+6, "log-uniform"),
        "gamma": Real(1e-6, 1e+1, "log-uniform")
    }

    best_accuracy = 0
    best_model = None
    for i in range(n_tries):
        done = False
        while not done:
            try:
                optimizer = BayesSearchCV(SVMEnsemble(), search_spaces, cv=3, n_iter=10, n_jobs=-1, n_points=10,
                                          verbose=1)
                optimizer.fit(X_train, y_train)  # <- ERROR HERE
                accuracy = accuracy_score(y_test, optimizer)
                if accuracy > best_accuracy:
                    best_accuracy = accuracy
                    best_model = optimizer
                done = True
                print(i, "job done")
            except:
                pass

    return best_model.best_params_


if __name__ == "__main__":
    dataset_name = "acute_inflammations"

    loading_functions = {
        "acute_inflammations": load_acute_inflammations,
        "breast_cancer_coimbra": load_breast_cancer_coimbra,
        "breast_cancer_wisconsin": load_breast_cancer_wisconsin
    }

    X, y = loading_functions[dataset_name]()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.fit_transform(X_test)

    params = get_best_ensemble_params(X_train, y_train, X_test, y_test)
    params["n_jobs"] = -1
    params["random_state"] = 0
    model = SVMEnsemble(n_estimators=20, **params)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    print("Accuracy:", accuracy_score(y_test, y_pred))

My custom SVMEnsemble is just BaggingClassifier with hard-coded SVC :我的自定义 SVMEnsemble 只是带有硬编码SVCBaggingClassifier

import inspect
import numpy as np
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC
from skopt import BayesSearchCV


svm_possible_args = {"C", "kernel", "degree", "gamma", "coef0", "shrinking", "probability", "tol", "cache_size",
                     "class_weight", "max_iter", "decision_function_shape", "break_ties"}

bagging_possible_args = {"n_estimators", "max_samples", "max_features", "bootstrap", "bootstrap_features",
                         "oob_score", "warm_start", "n_jobs"}

common_possible_args = {"random_state", "verbose"}


class SVMEnsemble(BaggingClassifier):
    def __init__(self, voting_method="hard", n_jobs=-1,
                 n_estimators=10, max_samples=1.0, max_features=1.0,
                 C=1.0, kernel="linear", gamma="scale",
                 **kwargs):
        if voting_method not in {"hard", "soft"}:
            raise ValueError(f"voting_method {voting_method} is not recognized.")

        self._voting_method = voting_method
        self._C = C
        self._gamma = gamma
        self._kernel = kernel

        passed_args = {
            "n_jobs": n_jobs,
            "n_estimators": n_estimators,
            "max_samples": max_samples,
            "max_features": max_features,
            "C": C,
            "gamma": gamma,
            "cache_size": 1024,
        }


        kwargs.update(passed_args)

        svm_args = {
            "probability": True if voting_method == "soft" else False,
            "kernel": kernel
        }

        bagging_args = dict()

        for arg_name, arg_val in kwargs.items():
            if arg_name in svm_possible_args:
                svm_args[arg_name] = arg_val
            elif arg_name in bagging_possible_args:
                bagging_args[arg_name] = arg_val
            elif arg_name in common_possible_args:
                svm_args[arg_name] = arg_val
                bagging_args[arg_name] = arg_val
            else:
                raise ValueError(f"argument {voting_method} is not recognized.")

        self.svm_args = svm_args
        self.bagging_args = bagging_args

        base_estimator = SVC(**svm_args)
        super().__init__(base_estimator=base_estimator, **bagging_args)

    @property
    def voting_method(self):
        return self._voting_method

    @voting_method.setter
    def voting_method(self, new_voting_method):
        if new_voting_method == "soft":
            self._voting_method = new_voting_method
            self.svm_args["probability"] = True
            base_estimator = SVC(**self.svm_args)
            super().__init__(base_estimator=base_estimator, **self.bagging_args)
        elif self._voting_method == "soft":
            self._voting_method = new_voting_method
            self.svm_args["probability"] = False
            base_estimator = SVC(**self.svm_args)
            super().__init__(base_estimator=base_estimator, **self.bagging_args)
        else:
            self._voting_method = new_voting_method

    @property
    def C(self):
        return self._C

    @C.setter
    def C(self, new_C):
        self._C = new_C
        self.svm_args["C"] = new_C
        base_estimator = SVC(**self.svm_args)
        super().__init__(base_estimator=base_estimator, **self.bagging_args)

    @property
    def gamma(self):
        return self._gamma

    @gamma.setter
    def gamma(self, new_gamma):
        self._gamma = new_gamma
        self.svm_args["gamma"] = new_gamma
        base_estimator = SVC(**self.svm_args)
        super().__init__(base_estimator=base_estimator, **self.bagging_args)

    @property
    def kernel(self):
        return self._kernel

    @kernel.setter
    def kernel(self, new_kernel):
        self._kernel = new_kernel
        self.svm_args["kernel"] = new_kernel
        base_estimator = SVC(**self.svm_args)
        super().__init__(base_estimator=base_estimator, **self.bagging_args)

    def predict(self, X):
        if self._voting_method == "hard":
            return super().predict(X)
        elif self._voting_method == "soft":
            probabilities = np.zeros((X.shape[0], self.classes_.shape[0]))
            for estimator in self.estimators_:
                estimator_probabilities = estimator.predict_proba(X)
                probabilities += estimator_probabilities
            return self.classes_[probabilities.argmax(axis=1)]
        else:
            raise ValueError(f"voting_method {self._voting_method} is not recognized.")

From the way you describe your problem (that you are getting it "quite randomly") and the description of your data and the code I'm almost positive that the problem is with bagging classifier occasionally randomly selecting sub-sample of training examples with only one class.从您描述问题的方式(您“非常随机”地得到它)以及对数据和代码的描述,我几乎可以肯定问题在于装袋分类器偶尔会随机选择训练示例的子样本一个 class。 K-fold stratified split won't help you here because it only will control the original split(s) of your data into training/test, but not how BaggingClassifier picks random subsample of max_samples from training set. K-fold 分层拆分在这里对您没有帮助,因为它只会将数据的原始拆分控制到训练/测试中,而不是max_samples如何从训练集中选择 max_samples 的随机子样本。 If you look at the code of how BaggingClassifier picks a subsample you'll notice there is no protection against such issue.如果您查看BaggingClassifier 如何选择子样本的代码,您会发现没有针对此类问题的保护措施。

One very easy way to tell for sure is to replace the "max_samples": Real(0.1, 1, "uniform") with some smaller numbers eg "max_samples": Real(0.02, 0.03, "uniform") (or set to some fixed smaller value) and check that you start getting the error much more frequently.一种非常简单的确定方法是将"max_samples": Real(0.1, 1, "uniform")替换为一些较小的数字,例如"max_samples": Real(0.02, 0.03, "uniform") (或设置为一些固定较小的值)并检查您是否开始更频繁地收到错误。

I'm not sure if you really use it with n_tries=5 and n_iter=10 (seems a bit small for all the hyperparameters you have) or with larger values and/or maybe you run the whole script multiple times with different random seed, but in any case let's just compute the probability of getting such a problem with max_samples=0.1 and having a dataset with 120 examples with 55%/45% split.我不确定您是否真的将它与n_tries=5n_iter=10一起使用(对于您拥有的所有超参数来说似乎有点小)或更大的值和/或您可能使用不同的随机种子多次运行整个脚本,但无论如何,让我们只计算max_samples=0.1出现此类问题的概率,并且数据集包含 120 个示例,分割率为 55%/45%。 Let's say you got 96 examples for your training set with 45/55 split, eg 53+43 examples.假设您的训练集有 96 个示例,45/55 拆分,例如 53+43 个示例。 Now with bootstrap enabled each time you train a bagging classifier it will randomly pick, say 10 examples out of 96 (with replacement since bootstrap is enabled by default).现在,每次训练 bagging 分类器时都启用 bootstrap,它会随机选择 96 个示例中的 10 个(由于默认情况下启用了 bootstrap,因此会进行替换)。 Chances of picking all of them from larger class are (53/96)^10, ie approximately 0.26%.从较大的 class 中挑选出所有这些的机会是 (53/96)^10,即大约 0.26%。 That means that if you train 50 classifiers in a row like this chances of one of them hitting the issue are now 12.5%.这意味着,如果你像这样连续训练 50 个分类器,其中一个遇到问题的机会现在是 12.5%。 And if you continuously run some searches like that you're pretty much inevitably running into this problem (I did use the fact that max_samples=0.1 here for simplicity which is not correct, but you'll likely to get close to that value frequently enough).如果你不断地运行一些这样的搜索,你几乎不可避免地会遇到这个问题(为了简单起见,我确实在这里使用了max_samples=0.1的事实,这是不正确的,但你可能会经常接近这个值)。

The last question is what to do with the issue.最后一个问题是如何处理这个问题。 There are a few possible answers:有几个可能的答案:

  1. Ignore it - you get it randomly during the random search and no problem with all other attempts which didn't run into such issue.忽略它 - 您在随机搜索期间随机获得它,并且所有其他未遇到此类问题的尝试都没有问题。 Additionally you can catch the ValueError exception and if the error message is coming from the SVM complaining on only class present for training - skip such search iteration.此外,您可以捕获 ValueError 异常,并且如果错误消息来自 SVM 抱怨仅存在用于训练的 class - 跳过此类搜索迭代。
  2. Increase the minimum value for max_samples in your searches or make it dependent on number of examples.在搜索中增加max_samples的最小值或使其取决于示例数。

There are also other possibilities too - eg after you split your data in train/test you can artificially inflate your training data by replacing each sample with eg N identical samples (where N is eg 2 or 10) to reduce the chance of having bagging classifier randomly picking subsample with only one class.还有其他可能性 - 例如,在训练/测试中拆分数据后,您可以通过用例如N个相同的样本(其中N是例如 2 或 10)替换每个样本来人为地夸大您的训练数据,以减少装袋分类器的机会随机挑选只有一个 class 的子样本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM