Random Forest is overfitting

Question

I'm using scikit-learn with a stratified CV to compare some classifiers. I'm computing: accuracy, recall, auc.

I used for the parameter optimization GridSearchCV with a 5 CV.

RandomForestClassifier(warm_start= True, min_samples_leaf= 1, n_estimators= 800, min_samples_split= 5,max_features= 'log2', max_depth= 400, class_weight=None)

are the best_params from the GridSearchCV.

My problem, I think I really overfit. For example:

Random Forest with standard deviation (+/-)

precision: 0.99 (+/- 0.06)

sensitivity: 0.94 (+/- 0.06)

specificity: 0.94 (+/- 0.06)

B_accuracy: 0.94 (+/- 0.06)

AUC: 0.94 (+/- 0.11)

Logistic Regression with standard deviation (+/-)

precision: 0.88(+/- 0.06)

sensitivity: 0.79 (+/- 0.06)

specificity: 0.68 (+/- 0.06)

B_accuracy: 0.73 (+/- 0.06)

AUC: 0.73 (+/- 0.041)

And the others also look like logistic regression (so they are not looking overfitted).

My code for CV is:

for i,j in enumerate(data):
    X.append(data[i][0])
    y.append(float(data[i][1]))
x=np.array(X)
y=np.array(y)

def SD(values):

    mean=sum(values)/len(values)
    a=[]
    for i in range(len(values)):
        a.append((values[i]-mean)**2)
    erg=sum(a)/len(values)
    SD=math.sqrt(erg)
    return SD,mean

    for name, clf in zip(titles,classifiers):
    # go through all classifiers, compute 10 folds 
    # the next for loop should be 1 tab indent more, coudlnt realy format it here, sorry
    pre,sen,spe,ba,area=[],[],[],[],[]
    for train_index, test_index in skf:
        #print train_index, test_index
        #get the index from all train_index and test_index
        #change them to list due to some errors
        train=train_index.tolist()
        test=test_index.tolist()
        X_train=[]
        X_test=[]
        y_train=[]
        y_test=[]
        for i in train:
            X_train.append(x[i])

        for i in test:
            X_test.append(x[i]) 

        for i in train:
            y_train.append(y[i])

        for i in test:
            y_test.append(y[i]) 


        #clf=clf.fit(X_train,y_train)
        #predicted=clf.predict_proba(X_test)
        #... other code, calculating metrics and so on...
    print name 
    print("precision: %0.2f \t(+/- %0.2f)" % (SD(pre)[1], SD(pre)[0]))
    print("sensitivity: %0.2f \t(+/- %0.2f)" % (SD(sen)[1], SD(pre)[0]))
    print("specificity: %0.2f \t(+/- %0.2f)" % (SD(spe)[1], SD(pre)[0]))
    print("B_accuracy: %0.2f \t(+/- %0.2f)" % (SD(ba)[1], SD(pre)[0]))
    print("AUC: %0.2f \t(+/- %0.2f)" % (SD(area)[1], SD(area)[0]))
    print "\n"

If I'm using the scores = cross_validation.cross_val_score(clf, X, y, cv=10, scoring='accuracy') method, I don't get this "overfitting" values. So maybe there is something wrong in the CV method I'm using? But it is only for RF...

I did my own due to the lag of specificity score function in the cross_val_function.

Answer 1

Herbert,

if your aim is to compare different learning algorithms, I recommend you to use nested cross-validation. (I refer to learning algorithm as different algorithms such as logistic regression, decision trees, and other discriminative models that learn the hypothesis or model -- the final classifier -- from your training data).

"Regular" cross-validation is fine if you like to tune the hyperparameters of a single algorithms. However, as soon as you start to run the hyperparameter optimization with the same cross-validation parameters/folds, your performance estimate will likely be over-optimistic. The reason if you are running cross-validation over and over again, your test data will become "training data" to some extend.

People asked me this question quite frequently, actually, and I will take some excerpts from a FAQ section I posted here: http://sebastianraschka.com/faq/docs/evaluate-a-model.html

In nested cross-validation, we have an outer k-fold cross-validation loop to split the data into training and test folds, and an inner loop is used to select the model via k-fold cross-validation on the training fold. After model selection, the test fold is then used to evaluate the model performance. After we have identified our "favorite" algorithm, we can follow-up with a "regular" k-fold cross-validation approach (on the complete training set) to find its "optimal" hyperparameters and evaluate it on the independent test set. Let's consider a logistic regression model to make this clearer: Using nested cross-validation you will train m different logistic regression models, 1 for each of the m outer folds, and the inner folds are used to optimize the hyperparameters of each model (eg, using gridsearch in combination with k-fold cross-validation. If your model is stable, these m models should all have the same hyperparameter values, and you report the average performance of this model based on the outer test folds. Then, you proceed with the next algorithm, eg, an SVM etc.

I can only highly recommend this excellent paper that discusses this issue in more detail:

S. Varma and R. Simon. Bias in error estimation when using cross-validation for model selection. BMC bioinformatics, 7(1):91, 2006. ( http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1397873/ )

PS: Typically, you don't need/want to tune the hyperparameters of a Random Forest (so extensively). The idea behind Random Forests (a form of bagging) is actually to not prune the decision trees -- actually, one reason why Breiman came up with the Random Forest algorithm was to deal with the pruning issue/overfitting of individual decision trees. So, the only parameter you really have to "worry" about is the number of trees (and maybe the number of random features per tree). However, typically, you are best off to take training bootstrap samples of size n (where n is the the original number of features in the training set) and squareroot(m) features (where m is the dimensionality of your training set).

Hope that this was helpful!

Edit:

Some example code for doing nested CV via scikit-learn:

pipe_svc = Pipeline([('scl', StandardScaler()),
                     ('clf', SVC(random_state=1))])

param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]

param_grid = [{'clf__C': param_range, 
               'clf__kernel': ['linear']},
             {'clf__C': param_range, 
               'clf__gamma': param_range, 
               'clf__kernel': ['rbf']}]


# Nested Cross-validation (here: 5 x 2 cross validation)
# =====================================
gs = GridSearchCV(estimator=pipe_svc, 
                            param_grid=param_grid, 
                            scoring='accuracy', 
                            cv=5)
scores = cross_val_score(gs, X_train, y_train, scoring='accuracy', cv=2)
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

Random Forest is overfitting

Question

1 answers

solution1
12 ACCPTED

Random Forest is overfitting

Question

1 answers

solution1 12 ACCPTED

solution1
12 ACCPTED