簡體   English   中英

scikit-learn learning_curve函數在喂入SVM分類器時會引發ValueError

[英]scikit-learn learning_curve function throws a ValueError when fed a SVM Classifier

import pandas as pd
from sklearn.svm import SVC
from sklearn.model_selection import learning_curve

car_data = pd.read_csv('car.csv')
car_data['car_rating'] = car_data.car_rating.apply(lambda x: 'acc' if x != 'unacc' else 'unacc')
car_data = pd.get_dummies(car_data, columns=['buying_price', 'maintenance', 'num_doors', 'persons', 'luggage_boot', 'safety'])

y = car_data.car_rating
X = car_data.drop(['car_rating'], axis=1)

clf = SVC(kernel='poly', degree=3, C=1000)
plot_learning_curve(estimator=clf, title="Test", X=X, y=y, cv=10)

哪個返回錯誤

ValueError: The number of classes has to be greater than one; got 1

這對我來說毫無意義,因為“ car_rating”列絕對有兩個類。 進行值計數將返回:

unacc    1210
acc       518

因此有兩類,一個類別比另一個類別小,但足夠多,分層k折應該能夠在所有切口中都保留。 那么是什么導致了錯誤呢?

我正在使用的數據集可以在這里找到。 我確實更改了列名,並將“ good”和“ vgood”類折疊為“ acc”,但除此之外數據保持不變

編輯:這是plot_learning_curve的代碼:

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 10)):
    """
    Generate a simple plot of the test and training learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 3-fold cross-validation,
          - integer, to specify the number of folds.
          - An object to be used as a cross-validation generator.
          - An iterable yielding tfdrain/test splits.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).

    taken from: http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

這是完整的堆棧跟蹤:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-19-04113e3ff056> in <module>()
      1 # the built in learning curve
      2 clf = SVC(kernel='poly', degree=3, C=1000)
----> 3 plot_learning_curve(estimator=clf, title="Test", X=X, y=y, cv=10)

<ipython-input-9-022f43e40037> in plot_learning_curve(estimator, title, X, y, ylim, cv, n_jobs, train_sizes)
     50     plt.ylabel("Score")
     51     train_sizes, train_scores, test_scores = learning_curve(
---> 52         estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
     53     train_scores_mean = np.mean(train_scores, axis=1)
     54     train_scores_std = np.std(train_scores, axis=1)

~/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in learning_curve(estimator, X, y, groups, train_sizes, cv, scoring, exploit_incremental_learning, n_jobs, pre_dispatch, verbose, shuffle, random_state)
   1126             clone(estimator), X, y, scorer, train, test,
   1127             verbose, parameters=None, fit_params=None, return_train_score=True)
-> 1128             for train, test in train_test_proportions)
   1129         out = np.array(out)
   1130         n_cv_folds = out.shape[0] // n_unique_ticks

~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    777             # was dispatched. In particular this covers the edge
    778             # case of Parallel used with an exhausted iterator.
--> 779             while self.dispatch_one_batch(iterator):
    780                 self._iterating = True
    781             else:

~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
    623                 return False
    624             else:
--> 625                 self._dispatch(tasks)
    626                 return True
    627 

~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
    586         dispatch_timestamp = time.time()
    587         cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 588         job = self._backend.apply_async(batch, callback=cb)
    589         self._jobs.append(job)
    590 

~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)
    109     def apply_async(self, func, callback=None):
    110         """Schedule a func to be run"""
--> 111         result = ImmediateResult(func)
    112         if callback:
    113             callback(result)

~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)
    330         # Don't delay the application, to avoid keeping the input
    331         # arguments in memory
--> 332         self.results = batch()
    333 
    334     def get(self):

~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
    132 
    133     def __len__(self):

~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
    132 
    133     def __len__(self):

~/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, error_score)
    456             estimator.fit(X_train, **fit_params)
    457         else:
--> 458             estimator.fit(X_train, y_train, **fit_params)
    459 
    460     except Exception as e:

~/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py in fit(self, X, y, sample_weight)
    148 
    149         X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr')
--> 150         y = self._validate_targets(y)
    151 
    152         sample_weight = np.asarray([]

~/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py in _validate_targets(self, y)
    504             raise ValueError(
    505                 "The number of classes has to be greater than one; got %d"
--> 506                 % len(cls))
    507 
    508         self.classes_ = cls

ValueError: The number of classes has to be greater than one; got 1

是的,問題是由於train_sizes

最初的值是:

train_sizes=np.linspace(.1, 1.0, 10)

用於查找train_sizes_abs屬性(該屬性只是將訓練集的浮點數轉換為實際數字:

...
n_max_training_samples = len(cv_iter[0][0])
train_sizes_abs = _translate_train_sizes(train_sizes, n_max_training_samples)
...
...

然后,這實際上用於為每個折疊選擇增量訓練數據

...
else:
    train_test_proportions = []
    for train, test in cv_iter:
        for n_train_samples in train_sizes_abs:
            train_test_proportions.append((train[:n_train_samples], test))
...
...

這就導致了一個問題,當為第一次訓練選擇數據( train_test_proportions的第一個值)時,偶然地它僅包含一個類。 我們對此無能為力。

但是,如果我們可以在此之前將訓練數據改組,那么就不會發生此問題(改組后的選定數據仍然包含單個類,這仍然是極少的機會,但這種情況很少見)

因此,我們需要在learning_curve調用中添加shuffle參數:

train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv,
                                                        n_jobs=n_jobs, 
                                                        train_sizes=train_sizes, 
                                                        shuffle=True) 

之后,代碼將成功運行。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM