顺序拟合随机森林sklearn

Question

I am training a Random Forest Classifier in python using sklearn on a corpus of image data.我正在使用 sklearn 对图像数据语料库在 python 中训练随机森林分类器。 Because I am performing image segmentation I have to store the data of every pixel, which ends up being a huge matrix, like 100,000,000 long matrix of data points, and so when running a RF Classifier on that matrix, my computer gets a memory overflow error, and takes forever to run.因为我正在执行图像分割，所以我必须存储每个像素的数据，最终成为一个巨大的矩阵，例如 100,000,000 个长数据点矩阵，因此在该矩阵上运行 RF 分类器时，我的计算机出现内存溢出错误，并且需要永远运行。

One Idea I had was to train the classifier on sequential small batches of the dataset, therefore eventually training on the whole thing but each time improving the fit of the classifier.我的一个想法是在连续的小批量数据集上训练分类器，因此最终对整个事物进行训练，但每次都提高分类器的拟合度。 Is this an idea that could work?这是一个可行的想法吗？ Will the fit just override the last fit each time it is run?每次运行时，拟合是否会覆盖上次拟合？

Answer 1

You can use warm_start in order to pre-compute the trees:您可以使用warm_start来预先计算树：

# First build 100 trees on X1, y1
clf = RandomForestClassifier(n_estimators=100, warm_start=True)
clf.fit(X1, y1)

# Build 100 additional trees on X2, y2
clf.set_params(n_estimators=200)
clf.fit(X2, y2)

Alternatively或者

def generate_rf(X_train, y_train, X_test, y_test):
    rf = RandomForestClassifier(n_estimators=5, min_samples_leaf=3)
    rf.fit(X_train, y_train)
    print "rf score ", rf.score(X_test, y_test)
    return rf

def combine_rfs(rf_a, rf_b):
    rf_a.estimators_ += rf_b.estimators_
    rf_a.n_estimators = len(rf_a.estimators_)
    return rf_a

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)
# Create 'n' random forests classifiers
rf_clf = [generate_rf(X_train, y_train, X_test, y_test) for i in range(n)]
# combine classifiers
rf_clf_combined = reduce(combine_rfs, rfs)

顺序拟合随机森林sklearn

问题描述

1 个解决方案

解决方案1
4 已采纳 2016-12-13 13:36:09

顺序拟合随机森林sklearn

问题描述

1 个解决方案

解决方案1 4 已采纳 2016-12-13 13:36:09

解决方案1
4 已采纳 2016-12-13 13:36:09