[英]Random Forest warm_start = True gives value error when running the scoring function - operands could not be broadcast together
I am implementing a random forest forecast as baseline for my ml model.我正在实施随机森林预测作为我的 ml 模型的基线。 Since my X_train_split_xgb has shape (48195, 300), i need to do batchtraining (memory).
由于我的 X_train_split_xgb 具有形状 (48195, 300),因此我需要进行批量训练(内存)。 To do that i set up randomforest with warm_start=True, but when i enable this i get an error in rf.predict(X_train_split_xgb line, namely: ValueError: operands could not be broadcast together with shapes (48195,210) (48195,187) (48195,210). If warm_start = False i do not get this error and the code runs through. Does anybody know why i get this valuerror and how to fix it? I tried lots of stuff already. Appreciate your help!
为此,我使用warm_start=True 设置了随机森林,但是当我启用它时,我在 rf.predict(X_train_split_xgb 行,即:ValueError: 操作数无法与形状一起广播 (48195,210) (48195,187) 中出现错误) (48195,210). 如果warm_start = False 我没有收到这个错误并且代码会运行。有谁知道我为什么会收到这个valueerror 以及如何解决它?我已经尝试了很多东西。感谢你的帮助!
X_batch has shape (1000,300) X_batch 具有形状 (1000,300)
y_batch has shape 1000 y_batch 的形状为 1000
X_train_split_xgb has shape (48195, 300) X_train_split_xgb 有形状 (48195, 300)
y_train_split_xgb_encoded has shape 48195 y_train_split_xgb_encoded 的形状为 48195
i dont even know how it tries to broadcast (48195,210) (48195,187) (48195,210)together, where is 210 and 187 coming from?我什至不知道它是如何尝试将 (48195,210) (48195,187) (48195,210) 一起广播的,210 和 187 是从哪里来的?
from sklearn.ensemble import RandomForestClassifier
errors = []
rf = RandomForestClassifier(n_estimators=5,
random_state=0,warm_start=True)
for X_batch, y_batch in get_batches(X_train_split_xgb, y_train_split_xgb_encoded, 1000):
# Run training and evaluate accuracy
rf.fit(X_batch, y_batch)# warm_start=True
print(X_batch.shape)
print(rf.predict(X_train_split_xgb))
print(rf.score(X_train_split_xgb, y_train_split_xgb_encoded))
#pred = rf.predict(X_batch)
#errors.append(MSE(y_batch, rf.predict(X_batch)))
rf.n_estimators += 1
Error:错误:
ValueError: operands could not be broadcast together with shapes (48195,210) (48195,187) (48195,210)
Expected: code runs through and gives the scores at each iteration.预期:代码运行并在每次迭代时给出分数。 Actual: code stops running in loop2, thus, when the prediction/scoring needs to be done the second time.
实际:代码在 loop2 中停止运行,因此,当需要第二次进行预测/评分时。 stops in rf.predict()
停在 rf.predict()
Very late to answer, but leaving a response in case someone else runs into the same issue.很晚才回答,但留下回复以防其他人遇到同样的问题。
The reason this is happening is that the class labels are not stratified between different calls to the fit
method.发生这种情况的原因是类标签没有在对
fit
方法的不同调用之间分层。
I performed a simple test where I fed the same X and y to the fit
method in a loop, and that seems to work.我执行了一个简单的测试,我在循环中将相同的 X 和 y 提供给
fit
方法,这似乎有效。
rf = RandomForestClassifer(warm_start=True)
for _ in range(10):
X = df.head(100).drop(columns='class')
y = df.tail(100)['class'].values
rf.fit(X, y)
rf.score(X_test, y_test)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.