简体   繁体   English

对sci-kit中的随机森林分类器进行故障排除学习

[英]troubleshooting random forests classifier in sci-kit learn

I am trying to run the random forests classifier from sci-kit learn and getting suspiciously bad output - less than 1% of predictions are correct. 我正在尝试从sci-kit学习运行随机森林分类器,并得到可疑的不良输出-不到1%的预测是正确的。 The model is performing much worse than chance. 该模型的执行情况远非偶然。 I am relatively new to Python, ML, and sci-kit learn (a triple whammy) and my concern is that I am missing something fundamental, rather than needing to fine-tune the parameters. 我对Python,ML和sci-kit学习相对较新(这是一次三重打击),我担心的是我缺少一些基本知识,而不是需要微调参数。 What I'm hoping for is more veteran eyes to look through the code and see if something is wrong with the setup. 我希望更多的老手去看一下代码,看看设置是否有问题。

I'm trying to predict classes for rows in a spreadsheet based on word occurrences - so the input for each row is an array representing how many times each word appears, eg [1 0 0 2 0 ... 1]. 我正在尝试根据单词的出现情况来预测电子表格中行的类-因此每行的输入是一个数组,表示每个单词出现的次数,例如[1 0 0 2 0 ... 1]。 I am using sci-kit learn's CountVectorizer for do this processing - I feed it strings containing the words in each row, and it outputs the word occurrence array(s). 我正在使用sci-kit Learn的CountVectorizer进行此处理-我将包含每一行中单词的字符串提供给它,并输出单词出现数组。 If this input isn't suitable for some reason, that is probably where things are going awry, but I haven't found anything online or in the documentation suggesting that's the case. 如果此输入由于某种原因不适合,则可能是问题出在哪里,但我在网上或文档中都没有发现任何提示。

Right now, the forest is answering correctly about 0.5% of the time. 目前,森林大约有0.5%的时间正确回答。 Using the exact same inputs with an SGD classifier yields close to 80%, which suggests to me that the preprocessing and vectorizing I'm doing is fine - it's something specific to the RF classifier. 在SGD分类器中使用完全相同的输入将产生接近80%的收益,这向我暗示我正在做的预处理和矢量化很好-这是RF分类器特有的。 My first reaction was to look for overfitting, but even when I run the model on the training data, it still gets almost everything wrong. 我的第一个反应是寻找过度拟合,但是即使我在训练数据上运行模型,它仍然会出错。

I've played around with number of trees and amount of training data but that hasn't seemed to change much for me. 我玩了很多树木和大量的训练数据,但这对我来说似乎并没有太大变化。 I'm trying to only show the relevant code but can post more if that's helpful. 我试图仅显示相关代码,但如果有帮助,可以发布更多代码。 First SO post so all thoughts and feedback appreciated. 因此,请首先发表所有想法和反馈。

#pull in package to create word occurence vectors for each line
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1,charset_error='ignore')
X_train = vectorizer.fit_transform(train_file)
#convert to dense array, the required input type for random forest classifier
X_train = X_train.todense()

#pull in random forest classifier and train on data
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 100, compute_importances=True)
clf = clf.fit(X_train, train_targets)

#transform the test data into the vector format
testdata = vectorizer.transform(test_file)
testdata = testdata.todense()


#export
with open('output.csv', 'wb') as csvfile:
    spamwriter = csv.writer(csvfile)
    for item in clf.predict(testdata):
        spamwriter.writerow([item])

if with Random Forest (RF) you get so bad on the training set X_train, then something is definitely wrong, because you should get a huge percentage, above 90%. 如果使用随机森林(RF)会使训练集X_train变得如此糟糕,那么肯定是错误的,因为您应该获得很大的百分比,超过90%。 Try the following (code snippet first): 尝试以下操作(首先编写代码段):

print "K-means" 
clf  = KMeans(n_clusters=len(train_targets), n_init=1000, n_jobs=2)

print "Gaussian Mixtures: full covariance"
covar_type = 'full'    # 'spherical', 'diag', 'tied', 'full'     
clf = GMM(n_components=len(train_targets), covariance_type=covar_type, init_params='wc', n_iter=10000)

print "VBGMM: full covariance"
covar_type = 'full'    # 'spherical', 'diag', 'tied', 'full'     
clf = VBGMM(n_components=len(train_targets), covariance_type=covar_type, alpha=1.0, random_state=None, thresh=0.01, verbose=False, min_covar=None, n_iter=1000000, params='wc', init_params='wc')

print "Random Forest"
clf = RandomForestClassifier(n_estimators=400, criterion='entropy', n_jobs=2)

print "MultiNomial Logistic Regression"
clf = LogisticRegression(penalty='l2', dual=False, C=1.0, fit_intercept=True, intercept_scaling=1, tol=0.0001)

print "SVM: Gaussian Kernel, infty iterations"
clf = SVC(C=1.0, kernel='rbf', degree=3, gamma=3.0, coef0=1.0, shrinking=True,
probability=False, tol=0.001, cache_size=200, class_weight=None, 
verbose=False, max_iter=-1, random_state=None)
  1. different classifiers, the interface in sci-kit learn is basically always the same and see how they behave (maybe RF is not really the best). 不同的分类器,sci-kit learning中的界面基本上总是相同的,并观察它们的行为(也许RF并不是最好的)。 See code above 见上面的代码
  2. Try to create some randomly generated datasets to give to RF classifier, I strongly suspect something goes wrong in the mapping process that generates the vectorizer objects. 尝试创建一些随机生成的数据集以提供给RF分类器,我强烈怀疑在生成vectorizer对象的映射过程中出现了问题。 Therefore, start creating your X_train and see. 因此,开始创建X_train并查看。

Hope that helps 希望能有所帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM