简体   繁体   English

随机森林warm_start = True 在运行评分函数时给出值错误 - 操作数无法一起广播

[英]Random Forest warm_start = True gives value error when running the scoring function - operands could not be broadcast together

I am implementing a random forest forecast as baseline for my ml model.我正在实施随机森林预测作为我的 ml 模型的基线。 Since my X_train_split_xgb has shape (48195, 300), i need to do batchtraining (memory).由于我的 X_train_split_xgb 具有形状 (48195, 300),因此我需要进行批量训练(内存)。 To do that i set up randomforest with warm_start=True, but when i enable this i get an error in rf.predict(X_train_split_xgb line, namely: ValueError: operands could not be broadcast together with shapes (48195,210) (48195,187) (48195,210). If warm_start = False i do not get this error and the code runs through. Does anybody know why i get this valuerror and how to fix it? I tried lots of stuff already. Appreciate your help!为此,我使用warm_start=True 设置了随机森林,但是当我启用它时,我在 rf.predict(X_train_split_xgb 行,即:ValueError: 操作数无法与形状一起广播 (48195,210) (48195,187) 中出现错误) (48195,210). 如果warm_start = False 我没有收到这个错误并且代码会运行。有谁知道我为什么会收到这个valueerror 以及如何解决它?我已经尝试了很多东西。感谢你的帮助!

X_batch has shape (1000,300) X_batch 具有形状 (1000,300)

y_batch has shape 1000 y_batch 的形状为 1000

X_train_split_xgb has shape (48195, 300) X_train_split_xgb 有形状 (48195, 300)

y_train_split_xgb_encoded has shape 48195 y_train_split_xgb_encoded 的形状为 48195

i dont even know how it tries to broadcast (48195,210) (48195,187) (48195,210)together, where is 210 and 187 coming from?我什至不知道它是如何尝试将 (48195,210) (48195,187) (48195,210) 一起广播的,210 和 187 是从哪里来的?

from sklearn.ensemble import RandomForestClassifier

errors = []
rf = RandomForestClassifier(n_estimators=5,  
                                     random_state=0,warm_start=True)


for X_batch, y_batch in get_batches(X_train_split_xgb,        y_train_split_xgb_encoded, 1000):

        # Run training and evaluate accuracy
        rf.fit(X_batch, y_batch)# warm_start=True
        print(X_batch.shape)
        print(rf.predict(X_train_split_xgb))
        print(rf.score(X_train_split_xgb, y_train_split_xgb_encoded))
        #pred = rf.predict(X_batch)
        #errors.append(MSE(y_batch, rf.predict(X_batch)))
        rf.n_estimators += 1

Error:错误:

 ValueError: operands could not be broadcast together with shapes (48195,210) (48195,187) (48195,210)

Expected: code runs through and gives the scores at each iteration.预期:代码运行并在每次迭代时给出分数。 Actual: code stops running in loop2, thus, when the prediction/scoring needs to be done the second time.实际:代码在 loop2 中停止运行,因此,当需要第二次进行预测/评分时。 stops in rf.predict()停在 rf.predict()

Very late to answer, but leaving a response in case someone else runs into the same issue.很晚才回答,但留下回复以防其他人遇到同样的问题。

The reason this is happening is that the class labels are not stratified between different calls to the fit method.发生这种情况的原因是类标签没有在对fit方法的不同调用之间分层。

I performed a simple test where I fed the same X and y to the fit method in a loop, and that seems to work.我执行了一个简单的测试,我在循环中将相同的 X 和 y 提供给fit方法,这似乎有效。

rf = RandomForestClassifer(warm_start=True)

for _ in range(10):
    X = df.head(100).drop(columns='class')
    y = df.tail(100)['class'].values
    rf.fit(X, y)
    rf.score(X_test, y_test)

enter image description here在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM