Python - scikit-learn: how to specify a validation subset in decision and regression trees?

Question

I am trying to build decision trees and regression trees with Python. I am using sci-kit, but am open to alternatives.

What I don't understand about this library is whether a training and a validation subset can be provided, so that the library builds the model on the training subset, tests it on the validation and stops splitting based on some rules (typically when additional splits don't result in better performance on the validation subset- this prevents overfitting). For example, this is what the JMP software does ( http://www.jmp.com/support/help/Validation_2.shtml#1016975 ).

I found no mention of how to use a validation subset in the official website ( http://scikit-learn.org/stable/modules/tree.html ), nor on the internet.

Any help would be most welcome! Thanks!

Answer 1

There is a fairly rich set of cross validation routines and examples in the scikit learn cross validation section of the userguide .

Note that lots of progress seems to have been made in cross validation between SK-Learn version 0.14 and 0.15, so I recommend upgrading to 0.15 if you haven't already.

As jme notes in his comment, some of the cross validation capability has also been incorporated into the grid search and pipeline capabilities of SK-learn.

For completeness in answering your question, here is simple example, but more advanced k-fold, leave-one-out, leave-p-out, shuffle-split, etc. are all available:

import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()
iris.data.shape, iris.target.shape
((150, 4), (150,))

X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data,
                                                                     iris.target,
                                                                     test_size=0.4,
                                                                     random_state=0)

X_train.shape, y_train.shape
((90, 4), (90,))
X_test.shape, y_test.shape
((60, 4), (60,))

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)                           
0.96...

I hope this helps... Good luck!

Python - scikit-learn: how to specify a validation subset in decision and regression trees?

Question

1 answers

solution1
0 2014-12-02 18:48:40

Python - scikit-learn: how to specify a validation subset in decision and regression trees?

Question

1 answers

solution1 0 2014-12-02 18:48:40

solution1
0 2014-12-02 18:48:40