简体   繁体   English

Python-scikit-learn:如何在决策树和回归树中指定验证子集?

[英]Python - scikit-learn: how to specify a validation subset in decision and regression trees?

I am trying to build decision trees and regression trees with Python. 我正在尝试使用Python构建决策树和回归树。 I am using sci-kit, but am open to alternatives. 我正在使用sci-kit,但可以选择其他方式。

What I don't understand about this library is whether a training and a validation subset can be provided, so that the library builds the model on the training subset, tests it on the validation and stops splitting based on some rules (typically when additional splits don't result in better performance on the validation subset- this prevents overfitting). 我对该库不了解的是,是否可以提供训练和验证子集,以便该库在训练子集上构建模型,在验证上对其进行测试,并根据某些规则停止分割(通常在进行其他分割时)不会在验证子集上带来更好的性能-这可以防止过拟合)。 For example, this is what the JMP software does ( http://www.jmp.com/support/help/Validation_2.shtml#1016975 ). 例如,这就是JMP软件的作用( http://www.jmp.com/support/help/Validation_2.shtml#1016975 )。

I found no mention of how to use a validation subset in the official website ( http://scikit-learn.org/stable/modules/tree.html ), nor on the internet. 我没有在官方网站( http://scikit-learn.org/stable/modules/tree.html )或互联网上提到如何使用验证子集。

Any help would be most welcome! 任何帮助将是最欢迎的! Thanks! 谢谢!

There is a fairly rich set of cross validation routines and examples in the scikit learn cross validation section of the userguide . 用户指南的scikit学习交叉验证部分提供了相当丰富的交叉验证例程和示例。

Note that lots of progress seems to have been made in cross validation between SK-Learn version 0.14 and 0.15, so I recommend upgrading to 0.15 if you haven't already. 请注意,SK-Learn版本0.14和0.15之间的交叉验证似乎已经取得了很多进展,因此,如果您尚未升级到0.15,我建议您进行升级。

As jme notes in his comment, some of the cross validation capability has also been incorporated into the grid search and pipeline capabilities of SK-learn. 正如jme在他的评论中指出的那样,一些交叉验证功能也已合并到SK-learn的网格搜索和管道功能中。

For completeness in answering your question, here is simple example, but more advanced k-fold, leave-one-out, leave-p-out, shuffle-split, etc. are all available: 为了完整回答您的问题,这是一个简单的示例,但是可以使用更高级的K折,留一注,留一注,随机拆分等功能:

import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()
iris.data.shape, iris.target.shape
((150, 4), (150,))

X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data,
                                                                     iris.target,
                                                                     test_size=0.4,
                                                                     random_state=0)

X_train.shape, y_train.shape
((90, 4), (90,))
X_test.shape, y_test.shape
((60, 4), (60,))

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)                           
0.96...

I hope this helps... Good luck! 希望对您有帮助...祝您好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM