随机森林warm_start = True 在运行评分函数时给出值错误 - 操作数无法一起广播

Question

I am implementing a random forest forecast as baseline for my ml model.我正在实施随机森林预测作为我的 ml 模型的基线。 Since my X_train_split_xgb has shape (48195, 300), i need to do batchtraining (memory).由于我的 X_train_split_xgb 具有形状 (48195, 300)，因此我需要进行批量训练（内存）。 To do that i set up randomforest with warm_start=True, but when i enable this i get an error in rf.predict(X_train_split_xgb line, namely: ValueError: operands could not be broadcast together with shapes (48195,210) (48195,187) (48195,210). If warm_start = False i do not get this error and the code runs through. Does anybody know why i get this valuerror and how to fix it? I tried lots of stuff already. Appreciate your help!为此，我使用warm_start=True 设置了随机森林，但是当我启用它时，我在 rf.predict(X_train_split_xgb 行，即：ValueError: 操作数无法与形状一起广播 (48195,210) (48195,187) 中出现错误) (48195,210). 如果warm_start = False 我没有收到这个错误并且代码会运行。有谁知道我为什么会收到这个valueerror 以及如何解决它？我已经尝试了很多东西。感谢你的帮助！

X_batch has shape (1000,300) X_batch 具有形状 (1000,300)

y_batch has shape 1000 y_batch 的形状为 1000

X_train_split_xgb has shape (48195, 300) X_train_split_xgb 有形状 (48195, 300)

y_train_split_xgb_encoded has shape 48195 y_train_split_xgb_encoded 的形状为 48195

i dont even know how it tries to broadcast (48195,210) (48195,187) (48195,210)together, where is 210 and 187 coming from?我什至不知道它是如何尝试将 (48195,210) (48195,187) (48195,210) 一起广播的，210 和 187 是从哪里来的？

from sklearn.ensemble import RandomForestClassifier

errors = []
rf = RandomForestClassifier(n_estimators=5,  
                                     random_state=0,warm_start=True)


for X_batch, y_batch in get_batches(X_train_split_xgb,        y_train_split_xgb_encoded, 1000):

        # Run training and evaluate accuracy
        rf.fit(X_batch, y_batch)# warm_start=True
        print(X_batch.shape)
        print(rf.predict(X_train_split_xgb))
        print(rf.score(X_train_split_xgb, y_train_split_xgb_encoded))
        #pred = rf.predict(X_batch)
        #errors.append(MSE(y_batch, rf.predict(X_batch)))
        rf.n_estimators += 1

Error:错误：

 ValueError: operands could not be broadcast together with shapes (48195,210) (48195,187) (48195,210)

Expected: code runs through and gives the scores at each iteration.预期：代码运行并在每次迭代时给出分数。 Actual: code stops running in loop2, thus, when the prediction/scoring needs to be done the second time.实际：代码在 loop2 中停止运行，因此，当需要第二次进行预测/评分时。 stops in rf.predict()停在 rf.predict()

Answer 1

Very late to answer, but leaving a response in case someone else runs into the same issue.很晚才回答，但留下回复以防其他人遇到同样的问题。

The reason this is happening is that the class labels are not stratified between different calls to the fit method.发生这种情况的原因是类标签没有在对fit方法的不同调用之间分层。

I performed a simple test where I fed the same X and y to the fit method in a loop, and that seems to work.我执行了一个简单的测试，我在循环中将相同的 X 和 y 提供给fit方法，这似乎有效。

rf = RandomForestClassifer(warm_start=True)

for _ in range(10):
    X = df.head(100).drop(columns='class')
    y = df.tail(100)['class'].values
    rf.fit(X, y)
    rf.score(X_test, y_test)

enter image description here在此处输入图片说明

随机森林warm_start = True 在运行评分函数时给出值错误 - 操作数无法一起广播

问题描述

1 个解决方案

解决方案1
0 2021-11-19 17:05:49

随机森林warm_start = True 在运行评分函数时给出值错误 - 操作数无法一起广播

问题描述

1 个解决方案

解决方案1 0 2021-11-19 17:05:49

解决方案1
0 2021-11-19 17:05:49