简体   繁体   English

如何使用热启动

[英]How to use warm_start

I'd like to use the warm_start parameter to add training data to my random forest classifier.我想使用warm_start参数将训练数据添加到我的随机森林分类器中。 I expected it to be used like this:我希望它像这样使用:

clf = RandomForestClassifier(...)
clf.fit(get_data())
clf.fit(get_more_data(), warm_start=True)

But the warm_start parameter is a constructor parameter.但是warm_start参数是构造函数参数。 So do I do something like this?那么我会做这样的事情吗?

clf = RandomForestClassifier()
clf.fit(get_data())
clf = RandomForestClassifier (warm_start=True)
clf.fit(get_more_data)

That makes no sense to me.这对我来说毫无意义。 Won't the new call to the constructor discard previous training data?对构造函数的新调用不会丢弃以前的训练数据吗? I think I'm missing something.我想我错过了什么。

The basic pattern of (taken from Miriam's answer):的基本模式(取自 Miriam 的回答):

clf = RandomForestClassifier(warm_start=True)
clf.fit(get_data())
clf.fit(get_more_data())

would be the correct usage API-wise.将是正确的使用 API。

But there is an issue here.但是这里有一个问题。

As the docs say the following:正如文档所说:

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.当设置为 True 时,重用之前调用 fit 的解决方案并向集成添加更多估计器,否则,只适合一个全新的森林。

it means, that the only thing warm_start can do for you, is adding new DecisionTree's.这意味着, warm_start唯一能为您做的就是添加新的 DecisionTree。 All the previous trees seem to be untouched!之前所有的树木似乎都没有受到影响!

Let's check this with some sources :让我们用一些来源检查一下

  n_more_estimators = self.n_estimators - len(self.estimators_)

    if n_more_estimators < 0:
        raise ValueError('n_estimators=%d must be larger or equal to '
                         'len(estimators_)=%d when warm_start==True'
                         % (self.n_estimators, len(self.estimators_)))

    elif n_more_estimators == 0:
        warn("Warm-start fitting without increasing n_estimators does not "
             "fit new trees.")

This basically tells us, that you would need to increase the number of estimators before approaching a new fit!这基本上告诉我们,在接近新拟合之前,您需要增加估算器的数量!

I have no idea what kind of usage sklearn expects here.我不知道 sklearn 在这里期望什么样的用法。 I'm not sure, if fitting, increasing internal variables and fitting again is correct usage, but i somehow doubt it (especially as n_estimators is not a public class-variable).我不确定,如果拟合,增加内部变量并再次拟合是正确的用法,但我以某种方式怀疑它(特别是因为n_estimators不是公共类变量)。

Your basic approach (in regards to this library and this classifier) is probably not a good idea for your out-of-core learning here!你的基本方法(关于这个库和这个分类器)对于你的核心外学习来说可能不是一个好主意! I would not pursue this further.我不会进一步追求这个。

Just to add to excellent @sascha`s answer, this hackie method works:只是为了添加出色的@sascha 答案,这个 hackie 方法有效:

rf = RandomForestClassifier(n_estimators=1, warm_start=True)                     
rf.fit(X_train, y_train)
rf.n_estimators += 1
rf.fit(X_train, y_train) 
from sklearn.datasets import load_iris
boston = load_iris()
X, y = boston.data, boston.target

### RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=10, warm_start=True)
rfc.fit(X[:50], y[:50])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[51:100], y[51:100])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[101:150], y[101:150])
print(rfc.score(X, y))

Below is differentiation between warm_start and partial_fit.下面是warm_start 和partial_fit 之间的区别。

When fitting an estimator repeatedly on the same dataset, but for multiple parameter values (such as to find the value maximizing performance as in grid search), it may be possible to reuse aspects of the model learnt from the previous parameter value, saving time.当在同一数据集上重复拟合估计器时,但对于多个参数值(例如在网格搜索中找到最大化性能的值),可以重用从先前参数值中学到的模型的各个方面,从而节省时间。 When warm_start is true, the existing fitted model attributes an are used to initialise the new model in a subsequent call to fit.当warm_start 为true 时,现有的拟合模型属性用于在随后的fit 调用中初始化新模型。 Note that this is only applicable for some models and some parameters, and even some orders of parameter values.请注意,这仅适用于某些模型和某些参数,甚至某些参数值的顺序。 For example, warm_start may be used when building random forests to add more trees to the forest (increasing n_estimators) but not to reduce their number.例如,在构建随机森林以向森林添加更多树(增加 n_estimators)但不减少它们的数量时,可以使用warm_start。

partial_fit also retains the model between calls, but differs: with warm_start the parameters change and the data is (more-or-less) constant across calls to fit; partial_fit 还保留了调用之间的模型,但有所不同:使用warm_start,参数会发生变化,并且数据在调用 fit 时(或多或少)是恒定的; with partial_fit, the mini-batch of data changes and model parameters stay fixed.使用 partial_fit,小批量数据更改和模型参数保持固定。

There are cases where you want to use warm_start to fit on different, but closely related data.在某些情况下,您希望使用warm_start 来适应不同但密切相关的数据。 For example, one may initially fit to a subset of the data, then fine-tune the parameter search on the full dataset.例如,一个人可能最初适合数据的一个子集,然后在整个数据集上微调参数搜索。 For classification, all data in a sequence of warm_start calls to fit must include samples from each class.对于分类,一系列warm_start 调用中的所有数据都必须包含来自每个类的样本。

All warm_start does boils down to preserving the state of the previous train.所有warm_start都归结为保留前一列火车的状态


It differs from a partial_fit in that the idea is not to incrementally learn on small batches of data, but rather to re-use a trained model in its previous state.它与partial_fit不同之处在于其想法不是在小批量数据上增量学习,而是重新使用处于先前状态的训练模型。 Namely the difference between a regular call to fit and a fit having set warm_start=True is that the estimator state is not cleared, see _clear_state即对fit的常规调用与设置warm_start=True的 fit 之间的区别在于未清除估算器状态,请参阅_clear_state

if not self.warm_start:
    self._clear_state()

Which, among other parameters, would initialize all estimators:其中,除其他参数外,将初始化所有估计器:

if hasattr(self, 'estimators_'):
    self.estimators_ = np.empty((0, 0), dtype=np.object)

So having set warm_start=True in each subsequent call to fit will not initialize the trainable parameters, instead it will start from their previous state and add new estimators to the model.因此,在每次后续调用fit中设置warm_start=True不会初始化可训练参数,而是从它们之前的状态开始并向模型添加新的估计器。


Which means that one could do:这意味着可以这样做:

grid1={'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10]}

rf_grid_search1 = GridSearchCV(estimator = RandomForestClassifier(), 
                               param_distributions = grid1,
                               cv = 3,
                               random_state=12)
rf_grid_search1.fit(X_train, y_train)

Then fit a model on the best parameters and set warm_start=True :然后在最佳参数上拟合模型并设置warm_start=True

rf = RandomForestClassifier(**rf_grid_search1.best_params_, warm_start=True)
rf.fit(X_train, y_train)

Then we could perform GridSearch only on say n_estimators :然后我们只能n_estimators上执行GridSearch

grid2 = {'n_estimators': [200, 400, 600, 800, 1000]}
rf_grid_search2 = GridSearchCV(estimator = rf,
                               param_distributions = grid2,
                               cv = 3, 
                               random_state=12,
                               n_iter=4)
rf_grid_search2.fit(X_train, y_train)

The advantage here is that the estimators would already be fit with the previous parameter setting, and with each subsequent call to fit , the model will be starting from the previous parameters, and we're just analyzing if adding new estimators would benefit the model.这里的优点是估计量已经与之前的参数设置相匹配,并且随着每次后续调用fit ,模型将从之前的参数开始,我们只是分析添加新的估计量是否会使模型受益。

as @sascha pointed out, the previously fitted trees are untouched, and you need to add new estimators before calling fit again.正如@sascha 指出的那样,以前安装的树没有受到影响,您需要在再次调用 fit 之前添加新的估算器。 he seemed unsure how to change it, as it is a public variable.他似乎不确定如何更改它,因为它是一个公共变量。 the api provides a function called set_params() which allows this. api 提供了一个名为 set_params() 的 function,它允许这样做。 here's how i've done it in the past:这是我过去的做法:

training_data = list(random.sample(list(zip(INPUT, OUTPUT)), min([int(len(INPUT) * 0.80), 1300]))) 
# get either 80% of the data or 1300 samples, whichever is smaller
__INPUT=[]
__output=[]
for _I, o in training_data:
   __INPUT.append((_I))
   __output.append(o)
# re-split our random sample of tuples into 2 lists
regressor.fit(__INPUT, __output)
# first fit
est = int(int(len(regressor.estimators_) * random.choice([1.1, 1.3, 1.4, 1.4, 1.5, 1.5, 1.5, 1.6, 1.1, 1.11, 1.13, 1.1, 1.11, 1.13]))) 
# get current estimators times a number between 1.1 and 1.5...theres a better way to write this, but im putting the shitty version here for the copy-pasta people
print('Planting additional trees...', est - len(regressor.estimators_))
regressor = regressor.set_params(n_estimators=est, warm_start=True)
regressor.fit(__INPUT, __output)
# new trees fit

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM